« [Apogée] Eclipse/SWT XForms engine released ! | Main | CPS Ajaxification, round #6 a dynamic tree and an event oserver »

Feb 15, 2006

Looking for fast and memory friendly Python XML processing ?

You don't how to optimize your Python based XML application anymore ? Are you tired of running out of RAM ? You got memory leaks all around ?

I've been in this situation until last week for one year.

I spent the last week rewriting a customer application written in cDomlette using cElementTree (note when I wrote it at this time cELementTree didn't exist).

My cDomlette experience has been a real pain last year with this project for the reasons I described above.

Don't get me wrong about cDomlette. This is a library really well documented and much more better than the available DOM libraries in the standard Python distribution but you can't use it for applications such as the one I've been working on. Simply it's not enough.

The application I'm talking about is an application for financial auditors. This is based on CPS and make an heavy use of OpenOffice.org calc documents. Lots and lots of OOo calc documents to process, modify according to complex financial rules every transaction.

So two weeks ago I had problems with my application heavily requestedby several users. Too slow, the servers were overloaded processing XML documents. For those who know what I'm talking about note the bad performances weren't only related to the XML processing. Not only...

I couldn't optimize the XML processing part anymore then I decided to recode from scratch all the XML processing modules I wrote with cDomlette using cElementTree this time. And oh dude, it rocks ! It really really rocks ! It works amazingly well now. The code is much more readable and maintainable,
because of the elementtree API, it's fast and it's not consuming lots of memory. The same transactions are now completed ten times quicker than before !

Thank you Fredrik Lundh. You're the man !

I'm more than happy for several reasons here :
  • my application is working and my customer is happy.
  • having cElementTree working this way (meaning working so well)
    makes Python a first candidate language for XML processing.Java,
    for instance, doesn't have such module available it seems.
  • elementtree will be soon available within the Python standard library.

What about lxml ?

I considered using lxml for production but this library is too young right now. These are the problems I met trying to use it in production :
  • The dependencies are far too high. (libxml2 and libxslt bleeding
    edge revisions) This makes it hard to use everywhere and it seems
    to be the reason why Zope3 can't makes it a dependency for instance.
  • missing iterparse() method of elementtree for SAX-like processing (this one is a real killing in elementtre)
lxml should be pretty close to cElementTree performances though.

Of course, lxml has some nice features that elementtree does not provide yet because lxml is based on libxml2 and libxslt. But for 90% of the applications, I might need to write, elementtree does the job. And it does it really well.

(Post originally written by Julien Anguenot on the old Nuxeo blogs.)

Comments

About Us

We're the friendly employees of Nuxeo, a leading open source software vendor, which develops a complete Enterprise Content Management (ECM) software platform to help companies better produce, process, publish, archive, expose and find their information from digital assets to transactional documents.

» Follow us @nuxeo (Twitter)

» Connect on LinkedIn

» Visit Nuxeo.com

 

Download Nuxeo DM Nuxeo DM Screencasts Nuxeo Products