Document Actions
02/23/2006
ElementTree, serialization and namespace prefixes
The way ElementTree outputs namespaces in serialized output can be a problem with some applications.

Here is an example of such an ouput :
  >>> import cElementTree as etree
>>> stream = """<?xml version="1.0" encoding="UTF-8" ?>
... <doc xmlns="http://bar"
... xmlns:foo="http://foo/">
... <foo:sub/>
... </doc>""
>>>
>>> doc = etree.XML(stream)
>>> print etree.tostring(doc, encoding="UTF-8")
<?xml version="1.0" encoding="UTF-8" ?>
<ns0:doc xmlns:ns0="http://bar">
<ns1:sub xmlns:ns1="http://foo" />
</ns0:doc>
>>>

We can see that the declared namespaces are now given an alias and all prefixes are now changed using those defined aliases. This is absolutley correct in a XML point of view but you can be in trouble sometimes with some applications for which you are outputing XML from elementtree based Python programs because they do not support this properly on their side.

Here is a workaround I found but I don't know if others exist :

  >>> import cElementTree
>>> import elementtree.ElementTree
>>>
>>> my_namespaces = {'http://foo' : 'foo',
... 'http://bar/' : bar}
>>> elementtree.ElementTree._namespace_map.update(my_namespaces)
>>>
>>> stream = """<?xml version="1.0" encoding="UTF-8" ?>
... <doc xmlns="http://bar"
... xmlns:foo="http://foo">
... <foo:sub/>
... </doc>"""
>>>
>>> doc = cElementTree.XML(stream)
>>> print cElementTree.tostring(doc)
<bar:doc xmlns="http://bar">
<foo:sub xmlns:foo="http://foo" />
</bar:doc>
Here, this has been serialized without replacing the prefixes within qualifed names.

The idea is that we are adding well known namespace prefixes to elementtree default ones.

The default elementtre ones are defined within elementtree/ElementTree.py like below :
  _namespace_map = {
# "well-known" namespace prefixes
"http://www.w3.org/XML/1998/namespace": "xml",
"http://www.w3.org/1999/xhtml": "html",
"http://www.w3.org/1999/02/22-rdf-syntax-ns#": "rdf",
"http://schemas.xmlsoap.org/wsdl/": "wsdl",
}

This is not the best way I would have hope to find. Please let me know if you know any others.

The problem I had recently was with OpenOffice.org 1.1.x.  (I don't know about the version2 though).

I could parse and serialize OpenOffice.org content XML documents and read them from OpenOffice.org at first. But as soon as I was modifiying the document from OpenOffice.org then it wasn't taking the namespace prefix aliases into consideration while inserting new elements. I used this trick and now OpenOffice.org is happy. I'm gonna report this issue to Laurent to see if the OpenOffice.org guys are aware about this issue.

I fixed the issue as shown below. I used the nmspace.mod from the OOo dtd to find out the relevant OOo namespaces.

OOo_NS = "http://openoffice.org/2000/"

OFFICE_NS = "%soffice" % OOo_NS
TABLE_NS = "%stable" % OOo_NS
STYLE_NS = "%sstyle" % OOo_NS
TEXT_NS = "%stext" % OOo_NS
META_NS = "%smeta" % OOo_NS
SCRIPT_NS = "%sscript" % OOo_NS
DRAWING_NS = "%sdrawing" % OOo_NS
CHART_NS = "%schart" % OOo_NS
NUMBER_NS = "%snumber" % OOo_NS
DATASTYLE_NS = "%sdatastyle" % OOo_NS
DR3D_NS = "%sdr3d" % OOo_NS
FORM_NS = "%sform" % OOo_NS
CONFIG_NS = "%sconfig" % OOo_NS

FO_NS = "http://www.w3.org/1999/XSL/Format"
XLINK_NS = "http://www.w3.org/1999/xlink"
SVG_NS = "http://www.w3.org/2000/svg"
MATH_NS = "http://www.w3.org/1998/Math/MathML"
# This will be used for the XML serialization and elementtree.
NAMESPACE_MAP = {
OFFICE_NS : 'office',
TABLE_NS : 'table',
STYLE_NS : 'style',
TEXT_NS : 'text',
META_NS : 'meta',
SCRIPT_NS : 'script',
DRAWING_NS : 'drawing',
CHART_NS : 'chard',
NUMBER_NS : 'number',
DATASTYLE_NS : 'datastyle',
DR3D_NS : 'dr3d',
FORM_NS : 'form',
CONFIG_NS : 'config',
MATH_NS : 'math',
SVG_NS : 'svg',
XLINK_NS : 'xlink',
FO_NS : 'fo',
}

import elementtree.ElementTree as etree
etree._namespace_map.update(NAMESPACE_MAP)


Posted by Julien Anguenot @ 02/23/2006 03:56 PM. - Categories: coding, openoffice, python1 comments
Looking for fast and memory friendly Python XML processing ?

You don't how to optimize your Python based XML application anymore ?
Are you tired of running out of RAM ? You got memory leaks all around ?


I've been in this situation until last week for one year.

I spent the last week rewriting a customer application written in
cDomlette using cElementTree (note when I wrote it at this time cELementTree didn't exist)

My cDomlette experience has been a real pain last year with this
project for the reasons I described above.

Don't get me wrong about cDomlette. This is a library really well
documented and much more better than the available DOM libraries in
the standard Python distribution but you can't use it for applications
such as the one I've been working on. Simply it's not enough.

The application I'm talking about is an application for financial
auditors. This is based on CPS and make an heavy use of OpenOffice.org
calc documents. Lots and lots of OOo calc documents to process, modify
according to complex financial rules every transaction.

So two weeks ago I had problems with my application heavily requested
by several users. Too slow, the servers were overloaded processing XML
documents. For those who know what I'm talking about note the bad
performances weren't only related to the XML processing. Not only...

I couldn't optimize the XML processing part anymore then I decided to
recode from scratch all the XML processing modules I wrote with
cDomlette
using cElementTree this time. And oh dude, it rocks ! It really really rocks ! It works
amazingly well now. The code is much more readable and maintainable,
because of the elementtree API, it's fast and it's not consuming lots
of memory. The same transactions are now completed ten times quicker than before !

Thank you Fredrik Lundh. You're the man !


I'm more than happy for several reasons here :

  • my application is working and my customer is happy.

  • having cElementTree working this way (meaning working so well)
    makes Python a first candidate language for XML processing.Java,
    for instance, doesn't have such module available it seems.

  • elementtree will be soon available within the Python standard library.

What about lxml ?


I considered using lxml for production but this library is too young
right now. These are the problems I met trying to use it in production :

  • The dependencies are far too high. (libxml2 and libxslt bleeding
    edge revisions) This makes it hard to use everywhere and it seems
    to be the reason why Zope3 can't makes it a dependency for instance.


  • missing iterparse() method of elementtree for SAX-like processing (this
          one is a real killing in elementtre)

lxml should be pretty close to cElementTree performances though.

Of course, lxml has some nice features that elementtree does not
provide yet because lxml is based on libxml2 and libxslt. But for 90%
of the applications, I might need to write, elementtree does the job. And it does it really well.


Posted by Julien Anguenot @ 02/23/2006 01:50 PM. - Categories: coding, cps, nuxeo, openoffice, python, zope35 comments
02/01/2006
My slides from Solution Linux 2006 about ECM and Zope3
You can grab my slides from Solution Linux 2006 about "ECM et Zope3" here. These slidea are in French. The presentation was yesterday.

Posted by Julien Anguenot @ 02/01/2006 11:53 AM. - Categories: ZODB, cps, five, nuxeo, python, rich_client, slides, zope, zope3 -  0 comments
Last modified: 12/21/2005 01:11 AM

Nuxeo Bloggers: Log in!
Nuxeo - Indesko - Nuxeo 5 Project
All content is copyrighted by their author.
CPSSkins is Copyright © 2003-2006 by Jean-Marc Orliaguet. | CPS is Copyright © 2002-2006 by Nuxeo SAS.