Lucene-based cataloging solution for Zope-3 and CPS-3.4


I think it's time to drop a note to the outside world about what I've been working on for a little while at Nuxeo. I am pretty confident that this project is nowadays at the end of its first iteration.

This post will give you a short overview of the solution we chose to tackle which is the indexing and searching stack in a Zope and CPS architecture. I submitted an abstract to EuroPython this year. Hopefuly, I'll have the chance to give you more technical details at the conference in July.

Motivations


CPS is based on Zope and the standard cataloging solution of Zope, nowadays, is the ZCatalog. The ZCatalog works really well until a certain amount of indexed documents : that's a fact. As well, ZCatalog extensions, such as TextIndexNG, are of a great interest.

But, because there is a but,  the main problem is that Zope is dealing with a task it shouldn't have to deal with. As a result, it decreases the overall performances of the Zope platform itself. If you are not convinced just try out to inject 200k documents within a Zope instance (or a Plone one if you wish :)) with documents having 50 fields to be indexed and check how your response time is evolving when your instance is as much used by people working and writing within the database and by others consulting it and thus searching all along. In Nuxeo, we tried on large scale projects. It simply doesn't work well/fast enough for serious deployments. Zope gets really slow...

Anyways, you should consider the ZCatalog as what it is : a hack on top of the ZODB because the ZODB doesn't provide any native query language nor full indexing suppport.

For those reasons, we needed such a solution for our customer projects.

As well, this is following our vision of Zope3 being an integration platform for ECM applications where external services could be plugged in thanks to the Zope3 component architecture flexibility and the agility of the Python language.

What is Lucene ?


Lucene is an open source project from the Apache Software Foundation written in Java. This is a high-performance, full-featured text search engine library.

I would suggest that you check the website that contains a lot of useful information and documentation. As well, I would really recommand this book to anyone interested in working with Lucene and / or in understanding more deeply how it works and how to use it in a proper way. As well, some projects such as nutch are described as case studies which is more than interesting for anyone who wants to build a system on top of Lucene since the best practices are described within those case studies.

 In Nuxeo, we first integrated Lucene for a customer within the Apogee project scope. (Apogee is a framework based on Eclipse RCP for ECM rich client applications). Its use had been a real success so we decided to go further and see how we could leverage the use of Lucene server side.

What is PyLucene ?


The first time we've seriously considered using PyLucene was at last year's EuroPython conference after Andi Vajda's really great presentation of PyLucene. Andi is the actual main PyLucene developer. PyLucene is maintained by the Open Source Applications Foundation.

PyLucene is a GCJ-compiled version of Java Lucene integrated with Python. Its goal is to allow the use of Lucene's text indexing and searching capabilities from Python. It is designed to be API compatible with the latest version of Java Lucene.

PyLucene is freaking fast ! Even faster than the Java Lucene version according to the authors of the Lucene In Action book. Furthermore, It will be easily synchronized with the latest Java Lucene releases since this is not a from scratch port but a GCJ-compiled version of Java Lucene itself.

NXLucene : standalone Lucene indexation server


NXLucene is a standalone multi-threaded remote server handling Lucene stores. It takes advantage of the freaking fast PyLucene Python bindings and uses Twisted for its server implementation. It uses some part of the Zope3 component architecture as well. NXLucene currently supports the XML-RPC protocol. (Its roadmap includes an ICE connector for the 1.x branch.) As well, NXLucene might be seen as a good example of what could be achieved using the best parts of different worlds (Java Lucene , PyLucene, Zope3, Twisted,...). Bear in mind, that NXLucene is not running on top of the Zope AS. It is standlone.

NXLucene exposes an XML query language for indexing and searching operations. Note the Lucene native search query is of course still supported. Check the NXLucene interfaces

While installing NXLucene, you will install as well the core libs that might be used by third party Python programs. For instance, the query lib might be useful to help you format your NXLucene XML queries or still the testing library might be really helpful to write tests for your Python components that need to communicate with an NXLucene server.

This is important to note here that you can request NXLucene using any language. You will only need an XML-RPC client library to do so.

NXLucene is an open source project under the LGPL part of the CPS platform project.

For more information about NXLucene and its installation you may check the NXLucene website.

nuxeo.lucene : Zope 3 cataloging component


nuxeo.lucene is a cataloging component written on top of to the Zope3 application server currently offering an XML-RPC proxy to a NXLucene remote server. As well, It offers an abstraction for Python objects cataloging strategy providing the ability to specify how Python objects should be indexed and retrieved from a Lucene store through NXLucene. (This is important to note here, that whatever remote server providing an XML-RPC remote interface on a Lucene server could be theoretically used.)

Currently, this component is used through Five from CPS. Its integration on top of the Zope3 AS is not finished since we didn't need nuxeo.lucene outside of CPS yet. Feel free to participate to its development if you are interested about having nuxeo.lucene fully integrated on top of a stock Zope3 AS.

nuxeo.lucene is an open source project available under the ZPL part of the CPS platform project.

CPSLuceneCatalog : CMF Catalog replacement for CPS-3. 4


CPSLuceneCatalog is a CPS-3.4.x specific product adding the CPS specific business rules to nuxeo.lucene. For example, it takes care of the way different versions of CPS documents should be indexed. CPSLuceneCatalog is a complete substitute for the ZCatalog that is showing its limits while dealing with millions of objects. CPSLuceneCatalog will be shipped along with the next major release of CPS, version 4, along with the JackRabbit JCR repository.

CPSLuceneCatalog is almost fully backward compatible with the ZCatalog query syntax so be sure you code won't break if you want to migrate. I don't currently support 100% compatibility but I do support at least the subset of ZCatalog query syntax we have been using in CPS internals.

An upgrade step is already available on CPS 3.4.x instances.

CPSLuceneCatalog is an open source project available under the GPL part of the CPS platform project.

Already significant results !


The result is a big win on large scale deployments :

  • Indexing and searching are much faster and scalable compared to ZCatalog.
  • Indexing and searching are much more powerful compared to ZCatalog (Analysis, ranking, etc...)
  • Zope global performances are increased because Zope no longer deals with the indexing and searching business.

Looking for support ?

If you are looking for any technical information or help regarding these products please subscribe to the CPS devel mailing list.

If you are looking for commercial support, Nuxeo provides professional services whatever your needs are.

Nuxeo is currently maintaining NXLucene, nuxeo.lucene and  CPSLuceneCatalog and we are always welcoming third-party contributors. As a developer, if you are interested about contributing to these projects, we will grant you access to our svn repositories and provide you all the information you need in order to get started. Just subscribe to the CPS devel mailing list.

Thanks


A big thanks to our customers at Nuxeo for trusting us, being patient and for always bringing along with, their projects, bleeding edge use cases.

And don't forget, at Nuxeo we love challenge and innovation !

Hope you'll enjoy those components as much as I enjoyed writing them for our customers. Looking forward to hearing from you.

    J.

Important announcement: Join the Nuxeo team and contribute to the Nuxeo project! We have open positions in France and the UK for open source Java EE developers and sales engineers, both junior and senior.

Like this post? Share it:


Trackback Pings

Trackback URL for this entry:
http://blogs.nuxeo.com/sections/blogs/julien_anguenot/2006_06_02_lucene-based-cataloging-solution-for-zope-3-cps-3-4/tbping
Posted by Julien Anguenot @ 06/04/2006 07:34 PM. - Categories: ZODB, cps, ecm, java, nuxeo, python, zope, zope3 -  13 comments

Nuxeo Bloggers: Log in!
Nuxeo - Indesko - Nuxeo 5 Project
All content is copyrighted by their author.
CPSSkins is Copyright © 2003-2006 by Jean-Marc Orliaguet. | CPS is Copyright © 2002-2006 by Nuxeo SAS.