Working with XML and namespace in Python

One important thing in programming today is XML. Many applications rely on XML because it want to store hierarchical data in persistent storage aka file. This truth is applied to most Java application because it seems, in my opinion, Java has the most larger sets of XML libraries from the top to the ground. In other words, ones may do anything with XML in Java very easy. Unfortunately, I am not a good Java programmer in any point of view. My favorite language is Python and OS is Linux. XML in Python is not as good as that one in Java. That's a problem.

Actually, my main work also relates to XML and the programming languages is Python and PHP. Python is the best in my opinion; however, PHP is easier for other. By the way, the core codes are still written in Python for sure. To work with XML, I just need 2 functions:

  1. Parse or read XML format into in-memory data structure
  2. Generate or write in-memory data structures in XML format

That's all. There are several implementation of XML interfaces for Pythonian like me. The most official one is PyXML which is a kind of heavy-weight implementation. It was written in almost pure Python except only low-level parser. Its APIs are very pythonic. I like it! Anyway, it consumes lots of memory and it is so slow for big XML file. The next official one for Python 2.5 is ElementTree which is also implemented in Python. ElementTree is not developed by extending PyXML, instead, it is a rewritten piece for light-weight application. It is a kind of de facto standard for the XML in Python since there are several modules compatible to its APIs including its C implementation. The highest performance could be obtained from lxml.

Another issue is namespace. The original ElementTree APIs didn't cover all necessary pieces to handle namespaces effectively. Namespace is very important in order to embed specific data element into the main element through namespace. Anyway, my work doesn't need so complex namespace support so I don't have realistic XML by now. I will give an example by other XML.

See GDACS, the main data structure is RSS but it is embedded by many custom data types in each item. Then see EDXL, it is a kind of container object that may wrap GDACS for transmitting among system. Practically, GDACS should be embedded in EDXL perfectly. The result is GDACS to EDXL conversion. Take a closeer look in . item uses the blank prefix to represent RSS while the upper element uses the blank prefix to represent EDXL. This is handled by scope of each element.

To convert GDACS to EDXL, I need 2 things.

  1. GDACS Parser
  2. EDXL Generator

One quesion occured in my mind. Everyone read XML and store it in a DOM. Which is the best approach to represent hierarchical data aka XML in memory?

  1. Generic Object or Element
  2. Specific Object inherited from Element with some specialized methods

After working with lxml, its latest version supports both schemes. However, I prefer the second one because it is possible to control the validity of the schema by each element automatically. Well, I don't think to write all specific classes manually. It should be possible to generate these classes from XML Schema or Relax NG. Hope to see something like this sooner or later.

Tags: , , , ,

Post new comment