Extracting nodes from namespaced XML

While most developers deride XML, opting instead for a subset of the second-most scolded language on the planet, sitemaps and Java tools still depend on this pointy format.

As part of getting my many pages indexed by the gargantuan gatekeepers to the light web, I produce a sitemap using @astrojs/sitemap. And thanks to sitemaps predating jQuery, they’re encoded in XML.

A sitemap contains a set of url nodes with loc children. Each loc node’s text corresponds to the URL of a page you want crawled.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
  <url>
    <loc>https://jamesconroyfinn.com/</loc>
  </url>
</urlset>

Or at least, to the modern JSON-aficionado, these entries appear to be urls and locs. The reality, however, is “stranger and friction” because XML has namespaces and tools like xmllint might not behave as one would expect.

To find the textual content of these <url> nodes takes a little trick!

xmllint --xpath '//*[local-name()="loc"]/text()' [...]

If ignoring namespaces causes problems, one can include the namespace-uri() function to narrow results further.

//*[local-name()="loc" and namespace-uri()="http://www.sitemaps.org/schemas/sitemap/0.9"]