Monday, April 30, 2007

A quick Python recipe for a validating XML parser

ElementTree is now included in Python standard libraries from version 2.5 but as good as it is, it has no support for XMLschema validation and limited support for XPath. For that you need lxml which builds on the foundation of ElementTree.

Here's a few lines of Python to validate an XML document using a schema document using lxml:

from lxml import etree

# Parse the schema document
xsd = etree.ElementTree(file = 'schema.xsd')

# Build an XMLSchema object from the parsed document
xsv = etree.XMLSchema(xsd)

# Validate the document using the schema
doc = etree.ElementTree(file = 'doc.xml')
xsv.validate(doc)

And that's it!

If you also want to perform Xpath operations then here's a few examples:

# continuing from above

# Find all nodes with a tagname amount
nodes = doc.xpath('//amount)

# Find all nodes with a tagname amount and attribute value with value 7
nodes = doc.xpath('//amount[value=7])

# Need a namespace? Supply a dictionary
nodes = doc.xpath('//cdf:amount, {'cdf' : 'http://uri.namespace.org/1.0'})


Later: lxml uses libxml under the hood to do its magic. Apparently, there are some bugs. When trying to validate XCCDF documents, errors are generated. This forced me into actually using C++ to build a schema validator which was kind of useful, seeing as that's what I was supposed to be doing in the first place.

No comments: