Python and XPath

Python and XPath 2008-08-06

Quite some while ago, I decided to experiment with writing unit tests for some Nevow-based application code using XPath instead of DOM. DOM-based assertions in unit tests tend to be quite verbose compared to the XPath equivalent. Beyond that, though, the XPath seems to be able to express the intent of the test more accurately. It doesn’t rely on as many irrelevant details (like how many TextElements some CDATA is broken up into) - or at least that’s the idea I wanted to try out.

Unfortunately I only wrote a few tests like this and shortly afterwards I started focusing on non-XML tasks. As far as the experiment went, I think it went well, but it’s basically incomplete at this time.

That’s not what this post is about.

A while ago, Ubuntu’s Hardy Heron was released. In Hardy, the Python package xml.xpath has vanished. Apparently the package was unmaintained for some time and either Ubuntu or Debian or some combination of the two decided that it would no longer be supported. Sad for me, because now all my XPath-based tests are broken on Ubuntu, which happens to be my primary platform. It seems there are some other XPath libraries for Python available in Hardy, so today I tried to update my tests so that they will work on systems which still have xml.xpath but can also try to use a newer XPath library if xml.xpath is unavailable.

Unfortunately (again) this turned out to be more difficult than I expected. The fall-back library I selected is lxml.etree. The first problem I encountered is that the Python API for processing XPath in lxml.etree is incompatible with the API in xml.xpath. So instead of just changing a couple imports as I had hoped I would be able to do, I wrote a couple thin wrappers to expose both APIs in the same way. The second problem I encountered is that the result objects returned by lxml.etree have a different API than the result objects returned by xml.xpath. So I changed my thin wrappers to be more specific, extracting the exact data my unit tests required, rather than being somewhat more general and letting each test grab what it needed. This meant widening the API to two wrappers from one.

try:
    from xml.xpath import Evaluate
except ImportError:
    from lxml.etree import XPath, fromstring
    def evaluateXPath(path, document):
        return XPath(path).evaluate(fromstring(document))
    def evaluateTextXPath(path, document):
        return evaluateXPath(path, document)[0]
else:
    def evaluateXPath(path, document):
        return Evaluate(path, minidom.parseString(document))
    def evaluateTextXPath(path, document):
        return evaluateXPath(path, document)[0].wholeText

After solving these two problems, I had two unit tests passing on my Hardy system. Not a complete failure. I have more than two unit tests that use XPath, though. As it turns out, the next set of tests I looked at need to examine the values of attributes of nodes. This is another area where xml.xpath and lxml.etree present different APIs. If I want to make these tests pass, I need to write another wrapper function to expose attributes in a uniform manner. Here’s where I give up on fixing this for the moment. It seems it’s not reasonable to expect to transparently switch between xml.xpath and lxml.etree.

Perhaps there’s another XPath library out there I can use instead of lxml.etree. Anyone have any tips they’d like to share?