XPath is a way to select tags from an XML Document, it works a bit like a File-path and a bit like a CSS-Selector.
For the Examples on this page the following XML-Document is assumed.
An Overview of Dataformats.
XPath works like a file-path.
Example: To address the
title tag (node in XPath terminology).
Note: If you are inside a context (i.e. one of the
item elements) you can also use it like a relative file-path:
For wildcards on a single level one can use a
* for the tag name.
Example: Same as previous but with a wildcard.
// one can select all elements matching the following description no matter where in the document they are.
Example: Select all three
Note: Using the
/catalog//use will also work.
To get the attribute of a tag use the notation
Example: To get the
Dealing with Multiple Elements using Predicates
When specifying an XPath that yields multiple results maybe one wants not all elements that match a tag.
Writing a number in square brackets will work similar to an array access in a programming language:
Example: Select the second
item from the catalog:
In place of the Number on can also use expressions like
One can also compare against attributes here.
Example: Only select the
Note on Quoting: XPath uses single quotes so one can easily use it in XML attributes.
Boolean Logic: Boolean logic is implemented with the
or keywords and the
These are way more possibilities, the w3schools tutorial has a more complete predicate list.
Getting Node Text
To get the text inside a node use
/text() to select it.
Example: Get the text from the first items name:
Combine multiple XPaths
Sometimes the result of one XPath isn't enough and you want the combined results of multiple XPaths. (Like a SQL union)
To achieve this you join multiple XPaths using a pipe
| character like this:
/xpath1 | /xpath1
XPath with Namespaces
If some of your Nodes are part of a namespace, i.e. using the
<namespace:tag> syntax or the
<tag xmlns="https://example.org"> attribute
Note: When using the
xmlns attribute all children of that node are also in that namespace unless declared otherwise.
When a namespace is used you'll notice that your usual queries don't work for some strange reason.
To get the title one may want to try the XPath
/feed/title, but this will fail because of the namespace.
If it is possible to declare the namespace (i.e. in an XSLT-Sheet) then do that and select the element with the namespace prefix, i.e.
If it is not possible to declare the Namespace you can work around that by using the
local-name() function like this:
All of the following lead to the w3schools page.
Playing with XPath
xmllint command line utility can do XPath.
xmllint and XPath are the
jq of XML, if they seem a bit clunky that is because they have been around for a freaking long time.
Getting information from HTML
Example: Get the title of the HTML Document.
Example: Get the social media preview description.
Explanation: Both examples use
- to tell
xmllint to read from its standard input (that is the
curl output here), enable the HTML parser with the
--html option and discard any errors from
2>&-, which tells the shell to close the standard error.