Parsing XML in Python

XML is a popular data format because of its flexibility and extensibility. It is widely used to describe hierarchical data. For any data format, you need at least 2 operations: read and write. Basically, XML data is in form of tree. Once you have the XML tree, you can generate XML by walking through all nodes in that tree. So the biggest problem is to read XML tree from text stream, text file or text string. This process is usually called as parsing.

Technically, there are 2 standards regarding XML; SAX and DOM. SAX is the standard of event-driven parsing technique designed for general XML. To use SAX, you have to handle start tag, end tag, characters and other special elements. DOM is another standard of describing hierarchical data in the object-oriented way. So, you can choose to use only SAX to manually create the structure of XML or to query data in XML by walking through the XML stream, otherwise, you might use SAX to parse the XML to create a DOM tree. DOM tree usually stores all data in XML in lossless mode so you may generate XML from DOM tree anytime you want. Anyway, DOM tree is a kind of memory consumption data structure. It works great but takes lot of memory.

Fortunately, Python supports both SAX and DOM. In addition, Python offers MiniDOM as an alternative, light-weight implementation of original DOM. In this post, I will not go into detail how to implement a parser by SAX but go straight to read the whole XML at once to create DOM or MiniDOM using standard APIs.

The first example is DOM.

from xml.dom.ext.reader import Sax2

reader = Sax2.Reader()
doc = reader.fromString(s)
doc = reader.fromStream(sys.stdin)

The second example is MiniDOM.

from xml.dom import minidom

doc = minidom.parseString(s)
doc = minidom.parse(sys.stdin)

Both DOM and MiniDOM have very similar APIs so you can switch from MiniDOM to DOM whenever you want with just slightly modification. I always use MiniDOM.

Technorati Tags: , , , , , , ,

Post new comment