XML XPath

Date: — Lang: — by Slatian

XPath is a way to select tags from an XML Document, it works a bit like a File-path and a bit like a CSS-Selector.

Example Document

For the Examples on this page the following XML-Document is assumed.

<?xml version="1.0" encoding="UTF-8"?>
<catalog>
	<title>An Overview of Dataformats.</title>
	<use>Describes Dataformats</use>
	<item>
		<name>JSON</name>
		<use>Data Serialization</use>
		<link type="website" href="https://www.json.org" />
		<link type="wikipedia" href="https://en.wikipedia.org/wiki/JSON" />
	</item>
	<item>
		<name>XML</name>
		<use>Documents</use>
		<link type="website" href="https://www.xml.com/" />
		<link type="wikipedia" href="https://en.wikipedia.org/wiki/XML" />
	</item>
</catalog>

Download Example XML

Querying Nodes

XPath works like a file-path.

Example: To address the title tag (node in XPath terminology).

/catalog/title

Note: If you are inside a context (i.e. one of the item elements) you can also use it like a relative file-path: use, ./name, ../title

For wildcards on a single level one can use a * for the tag name.

Example: Same as previous but with a wildcard.

/*/title

By using // one can select all elements matching the following description no matter where in the document they are.

Example: Select all three use tags.

//use

Note: Using the // like /catalog//use will also work.

Querying Attributes

To get the attribute of a tag use the notation /path/to/tag/@attribute.

Example: To get the href attributes:

//link/@href

Dealing with Multiple Elements using Predicates

When specifying an XPath that yields multiple results maybe one wants not all elements that match a tag.

Writing a number in square brackets will work similar to an array access in a programming language: tag[1]

Example: Select the second item from the catalog:

/catalog/item[2]

In place of the Number on can also use expressions like last(), last()-1 or position()>1.

One can also compare against attributes here.

Example: Only select the type="wikipedia" links.

//link[@type='wikipedia']

Note on Quoting: XPath uses single quotes so one can easily use it in XML attributes.

Boolean Logic: Boolean logic is implemented with the and and or keywords and the not() function.

These are way more possibilities, the w3schools tutorial has a more complete predicate list.

Getting Node Text

To get the text inside a node use /text() to select it.

Example: Get the text from the first items name:

//item[1]/name/text()

Getting Attribute Text

To get the text from an attribute one can prefix the attribute name with an @.

Example: To get all URLs from the link elements from the example document.

//link/@href

Note: This will select the attribute as a key-value pair in some contexts. To only get the value wrap it like this: concat('',//link/@href) (only keeps first element) or string-join(//link/@href,' ') (keeps all elements but requires XPath 2 support).

Example using xquilla to output all link URLs from the example document using newlines as delimiters. The printf is used to convert the \n to a real newline.
printf "string-join(//link/@href,'\n')" |
xqilla /dev/stdin -p -i xpath-example.xml 

Combine multiple XPaths

Sometimes the result of one XPath isn't enough and you want the combined results of multiple XPaths. (Like a SQL union)

To achieve this you join multiple XPaths using a pipe | character like this: /xpath1 | /xpath1

XPath with Namespaces

If some of your Nodes are part of a namespace, i.e. using the <namespace:tag> syntax or the <tag xmlns="https://example.org"> attribute

Note: When using the xmlns attribute all children of that node are also in that namespace unless declared otherwise.

When a namespace is used you'll notice that your usual queries don't work for some strange reason.

Snippet of Atom-Feed serving as an example documents here.
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
    <title>slatecave.net</title>
    <!-- Rest of feed -->
</feed>

To get the title one may want to try the XPath /feed/title, but this will fail because of the namespace.

If it is possible to declare the namespace (i.e. in an XSLT-Sheet) then do that and select the element with the namespace prefix, i.e. /atom:feed/atom:title.

If it is not possible to declare the Namespace you can work around that by using the local-name() function like this: *[local-name()='feed']/*[local-name()='title']

All of the following lead to the w3schools page.

Note: There is also the related RFC 9535: JSONPath: Query Expressions for JSON that might be interesting.

Playing with XPath

Using xmllint

The xmllint command line utility can do XPath.

xmllint <file.xml> --xpath "<xpath>"

Note: xmllint and XPath are the jq of XML, if they seem a bit clunky that is because they have been around for a freaking long time.

Compatibility: xmllint only works with XPath 1.0, if you get an error stating that a function is unregistred, thats why. It mostly affects functions that somehow involve lists.

Using xquilla

The xquilla command line tool supports XPath 2.0 and a whole lot of other XML functionality. It usually reads commands from a file and applies it to an XML-Document.

echo "<xpath>" | xquilla /dev/stdin -p -i <file.xml>

Getting information from HTML

Example: Get the title of the HTML Document.

curl https://slatecave.net | \
xmllint - --html --xpath "/html/head/title/text()" 2>&-

Example: Get the social media preview description.

curl https://slatecave.net | \
xmllint - --html --xpath \
	"string(
	/html/head/meta[@property='og:description']/@content |
	/html/head/meta[@property='og:description']/text()
	)" 2>&-

Explanation: Both examples use - to tell xmllint to read from its standard input (that is the curl output here), enable the HTML parser with the --html option and discard any errors from xmllint using 2>&-, which tells the shell to close the standard error.