I think it's time to drop a note to the outside world about what I've been
working on for a little while at
Nuxeo. I am pretty confident that this
project is nowadays at the end of its first iteration.
This post will give you a short overview of the solution we chose to tackle
which is the indexing and searching stack in a
Zope and
CPS architecture. I submitted an
abstract to
EuroPython this
year. Hopefuly, I'll have the chance to give you more technical details at
the conference in July.
Motivations
CPS is based on
Zope and the standard cataloging solution of
Zope, nowadays, is the
ZCatalog.
The
ZCatalog
works really well until a certain amount of indexed documents : that's a
fact. As well,
ZCatalog
extensions, such as
TextIndexNG,
are of a great interest.
But, because there is a
but, the main problem is that
Zope is dealing with a task it shouldn't have
to deal with. As a result, it decreases the overall performances of the
Zope platform itself. If you are not
convinced just try out to inject 200k documents within a
Zope instance (or a
Plone one if you wish :)) with documents
having 50 fields to be indexed and check how your response time is evolving
when your instance is as much used by people working and writing within the
database and by others consulting it and thus searching all along. In
Nuxeo, we tried on large scale projects. It
simply doesn't work well/fast enough for serious deployments.
Zope gets really slow...
Anyways, you should consider the
ZCatalog
as what it is : a hack on top of the
ZODB because the
ZODB doesn't provide any native query
language nor full indexing suppport.
For those reasons, we needed such a solution for our
customer projects.
As well, this is following our vision of
Zope3 being an integration platform
for
ECM applications where
external services could be plugged in thanks to the
Zope3 component architecture flexibility and the agility of the
Python language.
What is Lucene ?
Lucene is an open source
project from the
Apache Software
Foundation written in
Java. This is
a high-performance, full-featured text search engine library.
I would suggest that you check the
website that contains a lot
of useful information and documentation. As well, I would really recommand
this book to anyone interested in
working with
Lucene and /
or in understanding more deeply how it works and how to use it in a proper
way. As well, some projects such as
nutch are described as case
studies which is more than interesting for anyone who wants to build a
system on top of
Lucene
since the best practices are described within those case studies.
In
Nuxeo, we first integrated
Lucene for a customer within
the
Apogee project scope. (
Apogee is a framework based on
Eclipse RCP for
ECM rich client applications). Its
use had been a real success so we decided to go further and see how we
could leverage the use of
Lucene server side.
What is PyLucene ?
The first time we've seriously considered using
PyLucene was at last year's
EuroPython conference after
Andi Vajda's really
great presentation of
PyLucene.
Andi is the actual
main
PyLucene developer.
PyLucene is maintained by
the
Open Source Applications
Foundation.
PyLucene is a
GCJ-compiled version of
Java Lucene integrated with
Python. Its goal is to allow the use of
Lucene's text indexing and
searching capabilities from
Python. It
is designed to be API compatible with the latest version of
Java Lucene.
PyLucene is freaking fast
! Even faster than the
Java Lucene version according to
the authors of the
Lucene In
Action book. Furthermore, It
will be easily synchronized with the latest
Java Lucene releases since this
is not a
from scratch port but a
GCJ-compiled version of
Java Lucene itself.
NXLucene : standalone Lucene indexation server
NXLucene
is a standalone multi-threaded remote server handling
Lucene stores. It takes
advantage of the freaking fast
PyLucene Python bindings and uses
Twisted for its server
implementation. It uses some part of the
Zope3 component architecture as well.
NXLucene
currently supports the
XML-RPC
protocol. (Its roadmap includes an
ICE connector for the 1.x branch.)
As well,
NXLucene
might be seen as a good example of what could be achieved using the best
parts of different worlds (Java
Lucene ,
PyLucene,
Zope3,
Twisted,...). Bear in mind, that
NXLucene
is not running on top of the
Zope AS. It
is
standlone.
NXLucene
exposes an XML query language for indexing and searching operations. Note
the
Lucene native search
query is of course still supported. Check the
NXLucene
interfaces
While installing
NXLucene,
you will install as well the core libs that might be used by third party
Python programs. For instance, the query
lib might be useful to help you format your
NXLucene
XML queries or still the testing library might be really helpful to write
tests for your
Python components that
need to communicate with an
NXLucene
server.
This is important to note here that you can request
NXLucene
using any language. You will only need an
XML-RPC client library to do so.
NXLucene
is an open source project under the
LGPL part of the
CPS platform project.
For more information about
NXLucene
and its installation you may check the
NXLucene
website.
nuxeo.lucene : Zope 3 cataloging component
nuxeo.lucene
is a cataloging component written on top of to th
e Zope3 application server currently
offering an
XML-RPC proxy to a
NXLucene
remote server. As well, It offers an abstraction for
Python objects cataloging strategy
providing the ability to specify how
Python objects should be indexed and
retrieved from a
Lucene
store through
NXLucene.
(This is important to note here, that whatever remote server providing an
XML-RPC remote interface on a
Lucene server could be
theoretically used.)
Currently, this component is used through
Five from
CPS. Its integration on top of the
Zope3 AS is not finished since we
didn't need
nuxeo.lucene
outside of
CPS yet. Feel free to
participate to its development if you
are interested about having
nuxeo.lucene
fully integrated on top of a stock
Zope3
AS.
nuxeo.lucene
is an open source project available under the
ZPL part of the CPS platform project.
CPSLuceneCatalog : CMF Catalog replacement for CPS-3.
4
CPSLuceneCatalog is a
CPS-3.4.x specific
product adding the
CPS specific
business rules to
nuxeo.lucene.
For example, it takes care of the way different versions of
CPS documents should be indexed.
CPSLuceneCatalog is a complete substitute for the
ZCatalog
that is showing its limits while dealing with millions of objects.
CPSLuceneCatalog will be shipped along with the next major release of
CPS, version 4, along with the
JackRabbit JCR repository.
CPSLuceneCatalog is almost fully backward compatible with the
ZCatalog
query syntax so be sure you code won't break if you want to migrate. I don't
currently support 100% compatibility but I do support at least the subset of
ZCatalog
query syntax we have been using in
CPS internals.
An upgrade step is already available on
CPS 3.4.x instances.
CPSLuceneCatalog is an open source project available under the
GPL part of the CPS platform project.
Already significant results !
The result is a big win on large scale deployments :
- Indexing and searching are much faster and scalable compared to ZCatalog.
- Indexing and searching are much more powerful compared to ZCatalog
(Analysis, ranking, etc...)
- Zope global performances are
increased because Zope no longer deals
with the indexing and searching business.
-
Looking for support ?
If you are looking for any technical information or help regarding these
products please subscribe to the
CPS devel mailing
list.
If you are looking for commercial support,
Nuxeo provides professional services
whatever your needs are.
Nuxeo is currently maintaining
NXLucene,
nuxeo.lucene
and
CPSLuceneCatalog and we are always welcoming third-party contributors.
As a developer, if you are interested about contributing to these projects,
we will grant you access to our
svn
repositories and provide you all the information you need in order to
get started. Just subscribe to the
CPS devel mailing
list.
Thanks
A big thanks to our
customers at Nuxeo for
trusting us, being patient and for always bringing along with, their
projects, bleeding edge use cases.
And don't forget, at
Nuxeo we love
challenge and innovation !
Hope you'll enjoy those components as much as I enjoyed writing them for
our
customers. Looking
forward to hearing from you.
J.