Wednesday, August 6, 2008

Python and XPath

Quite some while ago, I decided to experiment with writing unit tests for some Nevow-based application code using XPath instead of DOM. DOM-based assertions in unit tests tend to be quite verbose compared to the XPath equivalent. Beyond that, though, the XPath seems to be able to express the intent of the test more accurately. It doesn't rely on as many irrelevant details (like how many TextElements some CDATA is broken up into) - or at least that's the idea I wanted to try out.

Unfortunately I only wrote a few tests like this and shortly afterwards I started focusing on non-XML tasks. As far as the experiment went, I think it went well, but it's basically incomplete at this time.

That's not what this post is about.

A while ago, Ubuntu's Hardy Heron was released. In Hardy, the Python package xml.xpath has vanished. Apparently the package was unmaintained for some time and either Ubuntu or Debian or some combination of the two decided that it would no longer be supported. Sad for me, because now all my XPath-based tests are broken on Ubuntu, which happens to be my primary platform. It seems there are some other XPath libraries for Python available in Hardy, so today I tried to update my tests so that they will work on systems which still have xml.xpath but can also try to use a newer XPath library if xml.xpath is unavailable.

Unfortunately (again) this turned out to be more difficult than I expected. The fall-back library I selected is lxml.etree. The first problem I encountered is that the Python API for processing XPath in lxml.etree is incompatible with the API in xml.xpath. So instead of just changing a couple imports as I had hoped I would be able to do, I wrote a couple thin wrappers to expose both APIs in the same way. The second problem I encountered is that the result objects returned by lxml.etree have a different API than the result objects returned by xml.xpath. So I changed my thin wrappers to be more specific, extracting the exact data my unit tests required, rather than being somewhat more general and letting each test grab what it needed. This meant widening the API to two wrappers from one.


try:
from xml.xpath import Evaluate
except ImportError:
from lxml.etree import XPath, fromstring
def evaluateXPath(path, document):
return XPath(path).evaluate(fromstring(document))
def evaluateTextXPath(path, document):
return evaluateXPath(path, document)[0]
else:
def evaluateXPath(path, document):
return Evaluate(path, minidom.parseString(document))
def evaluateTextXPath(path, document):
return evaluateXPath(path, document)[0].wholeText


After solving these two problems, I had two unit tests passing on my Hardy system. Not a complete failure. I have more than two unit tests that use XPath, though. As it turns out, the next set of tests I looked at need to examine the values of attributes of nodes. This is another area where xml.xpath and lxml.etree present different APIs. If I want to make these tests pass, I need to write another wrapper function to expose attributes in a uniform manner. Here's where I give up on fixing this for the moment. It seems it's not reasonable to expect to transparently switch between xml.xpath and lxml.etree.

Perhaps there's another XPath library out there I can use instead of lxml.etree. Anyone have any tips they'd like to share?

10 comments:

  1. yay having multiple elementtree implementations, whose package names vary with python version! see our sweet import skills at http://openidenabled.com/files/python-openid/repos/2.x.x/openid/oidutil.py

    ReplyDelete
  2. Well, I would just standardize on ElementTree API if I were you. I think lxml is the canonical implementation these days. There's also some XPath implementation that operates on ElementTree objects in Genshi, I think. At one time, I tried my own hand at a separate partial XPath implementation that would work on ElementTree elements, but I never really completed it. If you want it, I can send it to you, though.

    ReplyDelete
  3. Amara has an XPath implementation. But it's probably Yet Another Object Form, and so it'd have all the same problems. I think it's pure-Python, so at least you can rely on it in some sense, though I think it's also big. Generally I'd be reluctant to use anything from the xml package in Python except xml.etree.

    ReplyDelete
  4. I don't understand why you didn't use the XPath implementation that is already in Twisted? We use XPath for unit tests for our XMPP components at Chesspark, and it works great. The one feature we needed in the partial implementation that exists was very easy to add. Also, twisted's xpath implementation is faster than the C ones since it is not a full implementation.

    ReplyDelete
  5. Mostly because it's not as well documented as other XPath implementations it's incomplete (as you say) and I'm not interested in implementing the features I want but are missing - I'm looking for a library to use, not to maintain (I have enough of those already :). Performance is also not a concern in this case.

    And actually, like the DOM implementation in Twisted Words, I wish the XPath implementation didn't exist at all. There's plenty of useful stuff that Twisted Words does and could do that nothing else does - DOM and XPath don't really fall into that category.

    ReplyDelete
  6. Sigh. You know you belong to another era when you keep stumbling upon people that seem to think it's a bad thing that you've designed an API that has been successfully implemented by multiple independent developers and that is available for all Python versions since 1.5.2...

    ReplyDelete
  7. I don't think that it's a bad thing you've designed such an API. I just wish Python had more formal ways to say "this implements this API completely," and "get me something that implements this API," and that implementers remember exception types are part of the API too. ... otherwise the API user still has to be aware of all the different implementations, which is sorta different.

    ReplyDelete
  8. I understand the xpath API in twisted is not documented. It took me several hours to look for the parse(file.xml) method. Then, I found out about lxml.etre.

    Still, I would be curious to read the snippet that parses an xml file with the Twisted xml API.

    ReplyDelete
  9. you might want to look at vtd-xml for best possible xpath query perfomrance

    vtd-xml

    ReplyDelete
  10. Thanks. I hadn't heard of vtd-xml before. It looks like it's not available for Python, though. For now, my use-cases are mostly convenience driven rather than performance driven. It's good to know there's a high-performance library out there, though.

    ReplyDelete