Extracting nodes from namespaced XML
While most developers deride XML, opting instead for a subset of the second-most scolded language on the planet, sitemaps and Java tools still depend on this pointy format.
As part of getting my many pages indexed by the gargantuan gatekeepers to the
light web, I produce a sitemap using @astrojs/sitemap
. And
thanks to sitemaps predating jQuery, they’re encoded in XML.
A sitemap contains a set of url
nodes with loc
children. Each loc
node’s
text corresponds to the URL of a page you want crawled.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
<url>
<loc>https://jamesconroyfinn.com/</loc>
</url>
</urlset>
Or at least, to the modern JSON-aficionado, these entries appear to be url
s
and loc
s. The reality, however, is “stranger and friction” because XML has
namespaces and tools like xmllint
might not behave as one would
expect.
To find the textual content of these <url>
nodes takes a little trick!
xmllint --xpath '//*[local-name()="loc"]/text()' [...]
If ignoring namespaces causes problems, one can include the namespace-uri()
function to narrow results further.
//*[local-name()="loc" and namespace-uri()="http://www.sitemaps.org/schemas/sitemap/0.9"]