XML Paths in Programming Languages

I’ve finally finished my “Progress in Processing” talk for this year’s XML Summer School. It’s been really interesting looking at the different APIs developed for different programming languages in the last few years, all so much easier to use than the DOM. One of the themes is the use of path-based syntax to query XML.

Even with the simpler XML APIs, accessing nodes in an XML tree can be pretty laborious. For example, get all the <room> elements in the first <floor> element of a house with (this is XLinq):

doc.Element("house").Element("floor").Elements("room")

Of course XPath does this pretty well:

/house/floor[1]/room

and many of the APIs that I looked at provided XPath access. For example (this is Ruby’s REXML):

doc.elements["/house/floor[1]/room"]

But using XPath is tricky for a couple of reasons:

  • A cognitive leap is required to switch from the usual object/method dot-notation syntax that you use in the surrounding language to the specialised XPath notation. In particular, it’s difficult mixing the one-based indexing in XPath with the zero-based indexing that’s used in most programming languages.

  • XPaths have to be passed as strings; there’s a temptation to construct the strings automatically, which leads to all sorts of headaches (such as remembering to put quotes around the strings that you concatenate into the XPath when you really want them to be interpreted as strings rather than element names). [A clean way of approaching this would be to use variables in the XPath and pass in a set of variable bindings when you use the XPath, but I don’t know any API that actually does this.]

Because of these issues, there’s been some effort to use the native dot-notation syntax to query XML within general-purpose programming languages. I knew about JAXB before I started looking, but didn’t know before about Uche Ogbuji’s Amara or the details of the VB.NET interface. Whereas with JAXB you have to compile a schema (an XML Schema schema, what’s more) into Java classes, with Amara and VB.NET there’s the kind of dynamic binding you get with XPath. In Amara, for example, you can do:

doc.house.floor.room

while in VB.NET you can use (I think):

doc.<house>.<floor>.First().<room>

(Don’t ask me how to get the rooms on the second floor in VB.NET; that, I couldn’t figure out. In Amara, it’s doc.house.floor[1].room.)

Path-based syntax in general-purpose programming languages is really neat: it exposes XML documents as if they were objects, which makes them “closer” to you as a programmer. They work particularly well for data-oriented XML in which elements contain either elements or text and not both.

There are two main areas where the path-based languages differ.

First, what they do with paths with intermediate steps that select more than one element. For example, in XPath, /house/floor/room gets you all the rooms in all the floors of the house, as does doc.<house>.<floor>.<room> in VB.NET: both provide an implicit iteration over the selected elements in the intermediate steps. In Amara,doc.house.floor.roomgets you all the rooms in the *first* floor of the house, so you have to explicitly iterate over the` elements if you want to collect all the rooms in the house.

Second, how they handle namespaces. In XPath, you have to provide a set of namespace bindings whenever you evaluate an XPath expression, and the prefixes you use on element names are resolved against those namespace bindings. In XPath 1.0, element names with no prefix only match elements in no namespace; in XPath 2.0, you can also provide a default namespace that’s used for names with no prefix.

That works well when XPath is embedded in some XML (such as in XSLT, XForms, XProc and so on), because the namespace bindings from the XML environment can provide the namespace bindings for the XPath expression. But that can’t generally happen when XPaths are used in a programming language.

All the APIs that use XPaths allow you to specify the namespace bindings explicitly, but some, such as REXML, do an automatic namespace binding based on the namespace bindings from the source document. So if I have:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  ...
</xs:schema>

in the document I’m querying then I can use the xs prefix to mean the XML Schema namespace in the path that I use to query the document, such as /xs:schema/xs:element/@name to get the names of the global element declarations.

This makes paths nice and simple… right until you have to use them process a document that uses different namespace bindings. For example, it’s not uncommon to find XML Schema documents that use the prefix xsd instead of xs, or even the default namespace; for those documents, the automatic binding won’t work and the path /xs:schema/xs:element/@name will give you an error. [REXML also provides XPath.match() and XPath.each(), to which you can provide an explicit set of namespace bindings; you’ll use these if you care about keeping the indirection between prefixes and namespaces.]

In Amara (when using Pythonic paths), you can just forget about namespaces: the elements and attributes are selected purely based on their local name. The only time you’ll run into problems is if you actually have, in the same context, two elements from different namespaces with the same local name, which is an event that’s rarer than people using different prefixes for a given namespace. In the XML Schema example, you can use doc.schema.element.name (yes, attributes are picked up with the same syntax as elements), and will only have a problem if there’s an <element> element in some other namespace. [Amara also provides XPath-based querying, and you can supply explicit namespace bindings for that.]

In VB.NET, the Imports directive is used to provide global namespace bindings, so it gets the benefits that you would have from using XPath in an XML context. What’s more, you can use a default namespace binding so that you don’t have to use prefixes in your paths. So you can do:

Imports <xmlns:xs="http://www.w3.org/2001/XMLSchema">

and then doc.<xs:schema>.<xs:element>.@name and it will work as planned, no matter what prefixes were actually used in the schema document. Or you can do:

Imports <xmlns="http://www.w3.org/2001/XMLSchema">

and doc.<schema>.<element>.@name. Overall, I think it’s pretty impressive that VB.NET is going to have support for querying XML documents built in at such a low level.

Using default namespaces in paths is a tricky issue, though. I’ll have to dedicate a different post to that; this one’s quite long enough already.

Comments

Re: XML Paths in Programming Languages

thanks

Re: XML Paths in Programming Languages

I came across a few of the issues you mention such as XPath variables and default namespaces on a rather extensive XPath tool home-project of mine.

The main need was to be able to fully test XPath expressions embedded in XSLT just by copying and pasting them into the tools XPath checker without modification. Whilst inserting hidden ‘def:’ prefixes for default namespaces I couldn’t resist colorizing the XPath too once I had a working parser, XPath looks a lot prettier when fully colorized like the tools currently do with XSLT:-)

Other interesting but rather different problems I came across with XPath were related to XPath auto-generation, predicate-aware auto-completion, and an XPath tracer to allow you to step through an expressions, including stepping into predicates.

With auto-completion the issue was with having to deal with numerous un-closed nested/non-nested predicates etc. Many of the problems here are similar to those encountered with an XPath tracer.

On XPath auto-generation from an XML source, its fairly routine if the root node is the context node, but I’m going to have to generate XPath up or down the tree from the set context node and I’m not quite sure how to do this yet in a useful way.

With XPath variables I use grouped hashtables and store/load them in an XML serialized object - the next step is to write the XSLT to map the XML to an XSLT source file, so as to auto-populate the variables.

Namespaces with XPath will probably never be ideal. The most recent XML recommendation restricts (I think) some of the more eccentric ways that namespaces can be declared. This may help an XPath solution that deals with overriding of locally declared namespaces, but I haven’t investigated this yet.

Back to the main theme, XPath is now used so pervasively that, though there may be more elegant solutions, I don’t see it being sidelined any time soon, at least, I hope it won’t be, because I quite like it now.

Phil Fearon

Re: XML Paths in Programming Languages

A clean way of approaching this would be to use variables in the XPath and pass in a set of variable bindings when you use the XPath, but I don’t know any API that actually does this.

Well, there’s always printf. In Java:

String xpath = "/cheese[@strength='%s']";
List<Element> smellyCheeses = doc.selectNodes(String.format(xpath, "smelly"));

Re: XML Paths in Programming Languages

That’s part way there; I was really after something that used the XPath syntax for variables, and passed in a hash/dictionary of mappings from variable names to native values that were automatically mapped onto XPath data types. For example:

smellyCheeses = XPath.match(doc,
                            '/cheese[@strength = $strength]',
                            ('strength' => 'smelly'))
interestingLunches = XPath.match(recipes,
                                 '/recipes/recipe/ingredient
                                    [@name = $smellyCheeses/name]',
                                 ('smellyCheeses' => smellyCheeses))

Re: XML Paths in Programming Languages

Regarding namespaces and Amara's XPath functionality, when you construct an object you can send a "prefixes" keyword argument that declares the namespace prefixes you want to use. For example if I had an Atom document where there were different prefixes such as "a" and "atom", I could normalize these to whatever I want. For example:


xml = """<entry xmlns="http://www.w3.org/2005/Atom"
    xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns:a="http://www.w3.org/2005/Atom">
  <title>atom title</title>
  <id<urn:some-id-here</atom>
  <a:link rel="self" href="http://host.com/blog/this-entry.atom" />
  <atom:link rel="alternate" href="http://host.com/blog/this-entry" />
  <link rel="edit" href="http://host.com/blog/edit/13" />
  <content type="xhtml>
    <div xmlns="http://www.w3.org/1999/xhtml">some content</div>
  </a:content>
</entry>"""

atom_ns = u'http://www.w3.org/2005/Atom'
common_prefixes = {
    (u'atom', atom_ns),
}
doc = amara.parse(xml, prefixes=common_prefixes)
links = doc.xml_xpath(u'/atom:entry/atom:link')

This can be helpful b/c if you have some elements with prefixes, some without and some with different prefixes, it all just works. This can be helpful then cleaning up the namespaces to make them consistent.

I'm glad you looked at Amara. It truly has made working with XML a breeze.

Re: XML Paths in Programming Languages

The other big different between XPaths and path-based syntax in languages is what happens when a particular element doesn’t exist. For example, consider the XPath “/doc/house/floor/room”. The return value is a null if any of <doc>, <house>, <floor>, or <room> doesn’t exist in the document (or if the path in question doesn’t occur in the document). The programming language construct “doc.house.floor.room” throws a null pointer exception if “doc” or “doc.house” or “doc.house.floor” evaluates to null.

That makes a very big difference to your exception handling logic/code. You can just catch the exception each time (at some CPU cost which may or may not be significant), which means wrapping a try/catch construct around every path statement. That makes the code ugly and hard to read, unfortunately. So it turns out to be a big difference in practice.

Cheers, Tony.