Thanks to lots of progress in Apache Chemistry, to which Nuxeo is contributing, and through updated Nuxeo Chemistry bindings, the support for CMIS in Nuxeo is getting quite good.
Below are most of the new features available since the last release.
Better search
Fulltext search with CONTAINS() has been implemented so that you can do queries like:
SELECT cmis:name FROM cmis:document WHERE CONTAINS('foobar')
(The full scope of the fulltext search syntax, with ORing of words and negation, is not there yet.)
You can now also use the IN_TREE() and IN_FOLDER() predicates.
The SQL keywords are now case-insensitive as the spec requires, and complex boolean functions have been fixed.
More CRUD
A number of fundamental features form the CMIS domain model are now complete: object move, folder tree, folder descendants, delete, delete descendants.
Miscellaneous
Other fixes have been done: the types are served according to the latest 1.0CD06 draft, a number of fixes to make more CMIS clients happy have been included.
CMIS Shell
Finally, keep in mind that there is now an easy way to test a CMIS repository using a command line client. See the CMIS Shell blog post from Stéfane for more.
I recently had the opportunity to play with CMIS AtomPub bindings, in collaboration with our partners in Canada, as they were building a VBScript API to access a Nuxeo repository via the CMIS standard from a Microsoft-based environment. I had no concrete idea of what CMIS really was, but I had read a few articles about this "greatest common factor of document management concepts between all vendors," as Florent tweeted once ;-).
I don't intend, in this blog post, to give a complete description of CMIS, or its AtomPub bindings. There are good posts and tutorials for this. I want to explain a few points that were not initially straightforward to me, hoping that this will accelerate your first experience with CMIS. You should not be confronted with raw XML, as we'll see in this post, as long as you use mainstream languages: java, c#, python, etc. Indeed, there are or will be libraries to let you handle plain objects and methods. But the following approach will be very useful if you want to interface your document repository with a legacy application written in a not-so-common language that doesn't benefit from projects like Chemistry, or if you just want to understand some basics of CMIS.
Let's start with a few concepts:
in CMIS, entities are called objects and are described in XML using the Atom syntax. You'll find in the Atom entry a full description of an object (properties, linked objects, URLs, etc.)
each object and main collection of objects are tied to an HTTP URL. You will process information with respect to the semantic of the HTTP method used to communicate with the server:
GET is used to retrieve information on the object
PUT is used to update an object
POST to create
DELETE to delete
CMIS server gives you, in its Atom answers, URLs to get or process all the information you need. You just have to parse the answer, looking for "rel" links that give you URLs to download attached files, children, renditions, etc. You don't have to obtain the information from anywhere else but the server, which allows you to develop highly interoperable code.
For the operations that come next, you need to deploy the Chemistry Libraries and the Nuxeo-Chemistry implementation on your Nuxeo instance (or be sure that you have any other CMIS-compliant server and adapt the lengthening of the URLs). Or, an even easier solution is to use our Nuxeo CMIS demo server which is reset everyday and frequently updated to take into account your feedback.
To send an HTTP request to the server, I will use "curl." You can also use "wget", or any other HTTP requester. In some of the requests, I need to send XML file content (most often an Atom entry). You will find up-to-date samples of those XML files in the chemistry source code; they are very instructive. I also attached to this blog post the Atom entries I used in the following samples.
Let's start with the concrete experiment :)
The initial URL, to get information on the repository will be, from a shell:
Doing so, you will receive a feed of atom entries which is actually the collection of children of the root. Here is the atom entry associated to "default domain" in Nuxeo:
What is this Atom entry? It is the representation of the state of the object whose ID is 4593d0e5-fa7f-4ae1-9472-3030c270bb1e, transferred by the CMIS server (on top of our Nuxeo repository). You will find mainly four parts in it:
the Atom header, with elements id, title, updated, summary.
the content field (when it is "document", not a "folder") or an "alternate" element, that gives a URI to a remote content
the links associated to the document that give the URLs to request to perform operations on the object, such as service, self (the URL of the object), edit (to modify the object), alternate, described, up, down (we will see later the use of those links).
the cmisra object description, and among them you will find the custom type property values.
For some operations (like create or update metadata), and depending on how rigorous the implementation is, you need to be sure that:
the id element is not null
the title element is not null
the author or source is not null
either alternate link or content is not null
If you miss one of those rules, your request will be rejected. Now let's go with basic operations.
To browse the repository, use the up and down links:
In the result of the previous command, where I asked for the children of the "Default domain" node, I looked for the string "workspaces" to get the id of the workspaces object. I can then send a request to get the children of the workspaces node, etc.
To create a folder: Let's suppose you want to create a folder "myfolder" under the object of id "A", you will do a POST to the URL of the object that represents the collection of the children of "A". You get this URL by looking for the URL of type rel="down", and whose content-type is application/atom+xml;type=feed. Take care not to take the "tree" object, which is of type application/cmistree+xml and returns the hierarchy of descendants.
Note the header variables. If you forget to mention them, your request will not be processed successfully. The result of the previous request is an Atom entry corresponding to the newly created object. Note also that I specified in the Atom entry the Nuxeo custom type I wanted to use: "Workspace":
I can check, by browsing Nuxeo DM, that my folder has been created.
Document creation:
To create a document with content in this folder, there are basically two methods:
either you create an empty document (without the binary file, just the metadata) and then you update the content using the edit-media link (two requests).
or you create the document with the content in the Atom entry (content element) encoded in base64 (one request only).
Let's go with the first one. You need to prepare your Atom entry (basically get the one used for folder creation, and modify cmis:objectTypeId element with value "File"). Note that if you choose cmis:folder, or cmis:document, the Nuxeo implementation will map to Folder and File types. Also, the summary element is mapped to dc:description, title to the document name (and dc:title). The ID value you put in that case is not very important (not taken into account), but it should not be null.
Note here the Slug header parameter, to give the file name, as well as the Content-Type header parameter.
Metadata update:
To update your document's metadata, you need to PUT an Atom entry on the edit URL of the object, adding the values of the metadata you want to update under the element cmis:properties of the Atom entry describing the object. Let's say you want to update the title and to enter the "language" metadata of the previously created document. Metadata "language" belongs to the Dublin Core schema in Nuxeo.
Here is the most interesting part of the updatedocument.atomentry.xml Atom entry:
<cmisra:object>
<cmis:properties>
<cmis:propertyId propertyDefinitionId="cmis:objectTypeId">
<cmis:value>File</cmis:value>
</cmis:propertyId>
<cmis:propertyString propertyDefinitionId="dc:title">
<cmis:value>Change the title of the document using CMIS
</cmis:value>
</cmis:propertyString>
<cmis:propertyString propertyDefinitionId="dc:language">
<cmis:value>EN</cmis:value>
</cmis:propertyString>
</cmis:properties>
</cmisra:object>
Note under cmis:properties you find only the desired metadata plus the cmis:objectTypeId property, which is mandatory to avoid an error (ndlr:this should be verified).
To delete a non-empty folder, you need to perform the delete on the tree object, so you need to use the down link that returns an application/cmistree+xml object:
I've added recently a new project in nuxeo-webengine: nuxeo-webengine-gwt
This project provides the capability to develop GWT applications in Eclipse
and launch an embedded Nuxeo Server as part of the GWT dev. mode server
to be able to debug your Nuxeo GWT applications inside eclipse (without
the need to deploy it on a real Nuxeo server).
Also it provides a mechanism to transparently deploy the compiled GWT
applications in Nuxeo WebEngine (without the need to create separate WARs).
Nuxeo GWT applications are packaged as regular Nuxeo bundles. Also you can make use of the GWT RPC mechanism in your application without worrying about RPC servlets deployment in Nuxeo.
The GWT applications can be exposed to clients either through WebEngine objects, JAX-RS objects or custom servlets.
Note that this bundle is not yet part of any Nuxeo distribution so you need to put it by hand in your nuxeo server.
The upcoming CMIS standard is approaching its final 1.0 version, and I thought I would take the time to present some of its most advanced features.
Basics
I will not detail here the basics of the CMIS domain model, but I will mention quickly for completeness:
CMIS stores folders, documents and relationship (collectively called objects),
each object has a unique id,
objects have "object types" detailing the properties they're allowed to have,
properties have the usual basic "property types" (strings, numbers, dates, lists, etc.),
you can create, retrieve, update and delete objects (CRUD),
documents may have an associated content stream (an attachment),
you can search documents using a SQL-based language,
clients talk to CMIS servers using AtomPub or SOAP.
Below I will detail the more advanced features of CMIS.
Unfiling, Multi-filing
While most people are used to storing documents inside a navigation tree, where the intermediate tree nodes are folders, there are other ways to deal with content, which CMIS exposes through the concepts of "unfiling" and "multi-filing" (the term "filing" expresses the idea that a document is stored in a place, much like in the real world).
The first alternative way of storing a document is to not file it anywhere: the document is not held in a folder, it just exists: it is then said to be unfiled. The document is not lost however, because given a document id you can retrieve the properties and content stream of the document, and if you don't know its id you can do a search based on relevant criteria to find your document.
This model of unfiled documents is quite common in the world of record management, where what is important is the "record" (the content and metadata), and not a folder in which it may live. The record itself carries all the metadata you need to find it (dates, keywords, tags, etc.), and instead of listing "what's in a given folder", you can list records according to simple or complex search criteria.
The second alternative way provided by CMIS to store a document is to allow it to live in several folders at the same time: this is called multi-filing. It's another way of organizing content, and can be quite powerful.
Multi-filing is often used to organize documents in folders along several axes, where a folder represents a criterion and the presence of a document in a folder reflects the fact that the criterion applies to the document. Multi-filing can also be used to express "publishing" concepts (publishing a document in several categories means just filing it in different folders, each folder representing a category).
Both of these features are optional in a CMIS repository.
Renditions
In content management systems, it's quite common for a document to have different renditions. A rendition is an alternate way of viewing or representing a master document. For instance from an OpenDocument file you may derive a PDF rendition, a 100x140 pixels image rendition of the cover page, a Microsoft Word rendition, a rendition as a series of high-resolution images for each page, an HTML rendition, a pure text rendition, an MP3 rendition of the content as spoken text, etc. From a video document you may get a H.264 rendition, a Flash rendition, a 64x64 pixels image rendition, a rendition as a series of 320x200 pixels images every 10 seconds of the video, an MP3 rendition of the audio stream, a pure text rendition of the speech extracted from that audio stream, a text rendition of the extracted subtitles, etc.
CMIS doesn't expose any way to create or control these renditions (it's too complex, and up to the content management system to decide what they are), but it exposes a way to discover and retrieve them. Documents and folders can both have renditions, each rendition being seen as an alternate content stream.
Renditions have rudimentary metadata, among which a MIME type, a width and height (recognizing that rendition are often visually oriented), a title, and a "kind" which is used to categorize the renditions. CMIS only defines one standard kind, the thumbnail, but more could be added in future versions of the specification. The fact that it's useful for a folder to have a thumbnail or an icon is the reason why folders are allowed to have renditions while they can't have a normal content stream.
Rendition support is optional (and in any case it's the repository that decides what renditions to expose for each object).
Versioning
In CMIS a document (if its type supports it) can be versioned, which means that "old" versions are retained by the system. A version can be "major" or not, but CMIS doesn't impose any semantics on this, it's just a useful abstraction. To create new versions, a model of checkin/checkout is used: after checkout from a version, a private working copy (PWC) is created, which can be modified and then checked back in, creating a new version.
Here the model gets complex because in the real world there are many ways in which versioning can be done.
In the most complete scenario, the repository allows read and write access to all versions, including the PWC, and allows all versions and the PWC to be searched. The versions can also be filed independently in the same (or different) folders, several versions being then accessible at the same time.
This model can be restricted by the CMIS repository in various ways. The repository can specify that:
only the latest version may be accessible or searchable, not the older versions nor the PWC,
a PWC may be checked out from only the latest version,
a PWC may not be updatable at all, only checked back in with some modifications in a single operation,
a checkout may not be allowed at all, in which case new versions may be created only by applying an update to an existing version; this leaves the existing version unchanged but creates a new version holding the updated data (this is called auto-versioning),
all the versions of a given document are held in the same folder (this is called version-independent filing, the opposite is called version-specific filing),
only a single version of a document (the latest version or latest major version) may be filed in a folder, the other versions being "hidden" (not filed); when new versions or new major versions are created they automatically replace the previous one filed in the folder (this is another aspect of version-independent filing).
Given this wide variation of capabilities, having a generic client that understands all the versioning models will certainly be a challenge, but this is the cost of having interoperability with many systems that have different ideas of what versioning should look like.
Security through ACLs
Being able to access documents is the basis of content management, but in existing systems this access is often restricted by various permissions that depend on the user doing the action. The permission systems implemented by content repositories are extremely varied (even more than for versioning), and even though CMIS cannot hope to model them in an interoperable manner it's been recognized that some minimal operations can be agreed upon.
In order to work with permissions, a basic (and optional) set of permission management operations has been defined, based on access control lists (ACLs). The ACL on a document is a list of basic assignment of permissions to users, defining what they can do on this document.
CMIS defines three basic permissions: Read, Write, and All. It's up to each repository to define exactly the semantics of these permissions, but they are common enough that a client should be able to work with them easily even if the details are unknown to it: a client can easily tell a user if it will have the right to modify a document or not.
If a client really needs it, however, the CMIS repository exposes exactly what individual CMIS operations are allowed for each of these permissions. A repository can also define additional non-standard permissions, and using the same mechanism tell a client what operations will be allowed for each. In this manner, a client can discover in advance the restrictions placed on a document.
Optionally, a repository may allow a client to not only check but also change the ACL on a document, so that for instance other users are given rights to modify it, or instead disallowed from even seeing it.
ACLs are often more complex than just a list of permissions given to users on a document, for example many systems have inheritance of ACLs, which means that an ACL applied to a folder has an effect on the documents filed in that folder, and also on other documents further down the tree. Other systems have more complex rules. A CMIS repository can tell a client which of these three models (object-only, with inheritance, or completely repository-specific) it uses. When retrieving the ACL effective on a document, a repository can also tell a client if the ACL has really been set directly on that document, or if has somehow been derived from inherited ACLs or through more complex policies.
Change Log
It's important for external search services, caching systems or synchronization engines to be able to know what has happened in a repository since their "last visit". To that end, CMIS has an (optional) change log service that can be queried to discover the past operations that have been done in the repository after a specified date.
The change log service returns a list of basic operations that have happened in the repository: object creation, modification or deletion, as well as security changes on an object. For modification operations, the repository may also include the new values of properties set on that object.
The change log can be queried by starting from a given point in time materialized by an opaque "change log token", which a client should ask to the repository whenever it checkpoints its state. The repository will later be capable of returning all the changes made since that time.
If the repository cannot record all its history since it was created, the change log may be "incomplete"; in that case it may not be possible to get a change log starting from very old change log tokens. However when a repository returns changes from a supported change log token, all the changes up to the current moment must be returned: no intermediate changes can be lost.
Conclusion
I hope that this overview of its advanced features has convinced you that CMIS is a worthwhile standard, that many powerful things can be done with it, and that many vendors will soon be using it for interoperability. Nuxeo is committed to CMIS, and we'll be releasing a new version of our CMIS connector, supporting the latest 1.0cd04 draft, in a few days.
A final approval of CMIS 1.0 is expected in early-to-mid 2010. In the meantime, the Public Review of CMIS is still under way, please read the spec, implement it, and give feedback!
Last week I had the pleasure to attend the second
workshop organized by the IKS
project in Rome. The goal of this 4 years project is to
develop a software stack and a set of design guidelines to help CMS
developers leverage the promises of knowledge oriented software and Linked Data.
In the following I will give a brief overview of some of the
discussions that happened during those four days and a summary
of the Scribo project I presented during the demo sessions the
last day. A more complete coverage of the event can be found the event page
of the IKS wiki.
Materialized semantic indexes
Rupert Westenthaler from the Salzburg Research team is working
on a very interesting prototype to make CMS applications able to perform
fast complex graph queries on a knowledge base by materializing named
graph queries into flat Lucene indexes and tracking the knowledge base
changes to detect when the indexes need incremental updates.
To me this sounds a lot like the permanent MapReduce views used to
query the CouchDB document database. I really look forward to the release
of the first prototype along with some benchmarks to compare this
approach with general purpose un-materialized SPARQL engines such as
Jena SDB / TDB, Sesame and Virtuoso.
Bridging CMIS and RDF/OWL
During the workshop Gokce
Laleci introduced a prototype mapper from JCR to RDF, from content
structure to explicit semantic knowledge. The goal is to express the
underlying structure (document types and properties) specific to a given
CMS content store as a standard based and interoperable knowledge view
that can be directly aggregated by Linked Data crawlers.
She and her team will now work on a similar mapper for the CMIS protocol
using Nuxeo DM
and Apache Chemistry
as primary integration platform. Another interesting lead would be to
translate SPARQL queries in CMISQL when the mapping makes sense and
hence allow any CMIS content repository to behave as a SPARQL endpoint.
In the long term, I am not sure whether we want to keep the content
and knowledge in separate stores as we do currently in Nuxeo (Nuxeo Core and Jena). It might be
simpler and more efficient to combine them both in the Core and use such a
configurable knowledge mappers along with materialized graph queries to
implement the semantic features of Nuxeo.
Ontology-free semantic indexing
Stephane Gamard (@sgamard)
introduced the services offered by the SalsaDev platform. Their startup
focuses on leveraging an algorithm able to semantically index any
text document such as blog posts, web page snippets, wikipedia
articles and look up semantically related documents in all
indexed content without relying on explicit ontologies or topic
classification. Their approach offers the same advantages as Latent
Semantic Analysis but is also scalable to very large document
collections while LSA suffers from quadratic lookup times that
makes it unusable in practice.
This approach is very similar to a semantic
hashing prototype I have been working on my idle weekends for quite
some time now. The short term goal is to implement an image search
by similarity feature for the future Nuxeo Digital Asset Management
product. On a longer term the same algorithm should be adapted to also
work for text document similarity search.
Those purely data-driven approaches are interesting for at least two
reasons:
they allow for a natural implementation of the unstructured
"query by example" paradigm,
they can be combined with more structured semantic extractions to
perform disambiguation in a named entities recognition component
for instance.
Using UIMA for economic intelligence
Tommaso Teofili (@tommasoteofili)
from the Apache UIMA team
demoed a real application of semantic knowledge extraction to monitor the
temporal evolution real estate market prices in the Rome area. The assets
prices data categorized by surface and number of rooms is automatically
extracted from the raw unstructured content of public ads web pages
and aggregated in a relational database that feeds a charting and reporting
user interface.
The data extraction magic is performed by a UIMA chain
that wraps the online semantic engines provided by the AlchemyAPI web
service. Such semantic lifting services are typically what Nuxeo aims
to provide as part of the platform without relying on third party service
providers.
Incidentally Peter
Mika from Yahoo! Research is working on a similar prototype to find
his next flat in Barcelona.
Nuxeo and automated Semantic Knowledge extraction
As part of the demo session, I chose to present some of the ongoing
work done by Nuxeo and its partners as part of the Scribo project. One of the
goals of this project is to extract the occurrences of entities (such as
persons, organizations and places) semantic assertions between those
entities ("Person A" is the CEO of "Company B" or "Person B" has declared
that "he will reform the Health care system"). To that hand we chose
to package annotators as chained UIMA Analysis Engines and store the
extracted semantic annotations as RDF assertions using the classes of
the DBPedia ontology. Here are the slides introducing the context of
the demo:
The demo itself is two-fold. The first part features the Scribo
Workbench mainly developed by XWiki to configure and test a chain of
UIMA annotators to extract semantic knowledge from the text content
of documents coming from heterogeneous content repositories such as
a filesystem folder, a CMIS repository (Nuxeo DM) or an XWiki server
accessed through its RESTful API.
The user can then combine such a document source with one or several
registered annotators into a UIMA chain (a.k.a Collection Processing
Engine), run the process and view the results as annotated text document
directly in the Eclipse UI. The user can also validate or invalidate the
extracted annotations and hence incrementally build a validated knowledge
base of semantic statements out of his unstructured content. The following
screencast shows the details of this scenario using the Stanford Named
Entity Recognition annotator on 2 wikinews articles:
The second part of the demo showcases the deployment of the previous
UIMA chain directly inside a Nuxeo DM 5.3 instance. PDF documents are
directly semantically annotated at import time thanks to an asynchronous
event listener that calls a new UIMARunnerService packaged as
an OSGi components deployed by the Nuxeo Runtime.
The extracted named entities are stored in the default Nuxeo Jena
store. Some work is still needed to make the annotations show up correctly
in the "preview tab" and make it possible to validate / invalidate extractions
from the "knowledge base" tab.
Remember this is just the beginning and we plan to
support all languages significantly represented in Wikipedia
along with finer grained entity classes. You can also get an
overview on the global semantic R&D effort at Nuxeo on our
Jira.
Last but not least, the showcased demo is deployable on your own Nuxeo
DM 5.3 instance by deploying a simple plugin as explained on the UIMA page
of the Nuxeo wiki. Beware that this is really alpha alpha
work and should not be deployed on a production setup.
Standards for Social Business: Support for OpenSocial allows the creation of gadgets to build web mashups within enterprise applications. Nuxeo DM 5.3 can serve as both publisher of gadgets as well as be an OpenSocial container, allowing the Nuxeo DM repository to host and participate in enterprise mashups.
Windows Sharepoint Services support: Access Nuxeo DM 5.3 via Microsoft Sharepoint for basic library services. Native integration with Windows Explorer ensures information workers can use familiar browsing habits to access Nuxeo DM content and perform common file operations. Microsoft Office integration allows files to be opened or saved directly to/from Nuxeo DM and lets users see information about their content directly from the Document Panel in Microsoft Office.
Federated publishing: Centrally control and publish content to remote Nuxeo applications, file systems, HTTP servers, web portals, and more thanks to a pluggable and unified Publishing Service. It is suited to deployments with a distributed information architecture, balancing ease of publication with requirements to for content control.
Tagging Services: Enhanced metadata capabilities ensures information workers can categorize their content in ways that make sense for average users. Tagging recommendations and tag cloud support enrich an organization's ability to describe and retrieve their information assets.
New Performance Record for Large Deployments: Faster search and retrieval benchmarks for Nuxeo DM repositories holding several million items for hundreds of concurrent users using inexpensive commodity hardware specifications.
Email Capture Enhancements for MS Exchange and Gmail: Improved support for email folders. Fetch email from Microsoft Exchange or Gmail to/from the Nuxeo DM 5.3 email archive folders to ensure contextual management of business communication.
CMIS Support: Ensure true interoperability across multiple ECM, ERP and search systems, Nuxeo DM 5.3 offers a CMIS Server via an add-on component, based on the CMIS draft 0.62.
Search/Indexing Improvements: Granular metadata search and indexing allows users to find and retrieve content based on one or more fields.
Improved Developer Productivity: Several enhancements have been made to make the experience of developing on top of the Nuxeo platform more productive and enjoyable.
From 5.2.1 to 5.3
Our initial plan was to release a 5.2.1 at the beginning of the summer.
But, for good or for bad, we were very busy working on customer projects and missed the windows of opportunity to make a release at the time.
So when we finally could focus on making a new public release of Nuxeo DM, we realized that we had done too much work to just call it a simple maintenance release.
That's why this present release is numbered 5.3.
In parallel, critical fixes have been back-ported in 5.2 as a service pack (Nuxeo 5.2 SP1).
Full backward compatibility
This release introduces new services and APIs, but we did not break any existing API.
This means that developments done against 5.2 should run without major problems against a 5.3.
Unlike the 5.1->5.2 migration that required compatibility packages, the 5.2->5.3 migration should be completely pain-less (see below).
Additional features in Nuxeo DM
As always, we tried to include in Nuxeo DM as many useful features as possible.
Nevertheless, all the new features provided by Nuxeo EP are not directly visible in Nuxeo DM 5.3.
There are several reasons for this:
the Nuxeo DM distribution is already pretty big
some of the technical infrastructure improvements can not be simply demonstrated
This means that new features in Nuxeo EP:
have been included by default in DM when it makes sense
are available as add-ons in the other cases
For the features released as add-ons, all necessary new APIs in Nuxeo DM are already part of 5.3.
What's new in 5.3?
Web features
During the last months we've worked with our partners and customers on several intranet and portal projects.
This means a lot of small improvements have been made to WebEngine and the Theme engine.
Tag Service
You can now add tags to Nuxeo Documents and:
search documents based on their tags
navigate the document database via a tag cloud
The tag service is only available when using VCS storage.
In the middle term, it could become a feature of Nuxeo Core.
Here is a demo:
Blogs
We finally released a simple Blog implementation using WebEngine framework.
This blog is a simple example of how Documents can be managed in the back-office (in the default JSF WebApp) and accessed via a public Web interface.
WebWidgets
The Theme engine of 5.3 comes with support for WebWidgets.
It allows to include UWA JavaScript widgets containers in the Themes.
WebWidgets are very close to OpenSocial gadgets (see below). In the future we will make converge these 2 javascript portlets models.
Because OpenSocial and WebWidgets are providing very similar features, WebWidgets are not part of the default packaging.
OpenSocial
Contributed by the community
If you were at last year's Nuxeo DevDay conference, you probably know that Damien Metzler [video proof] and his team from Leroy Merlin have been working on Nuxeo WebEngine and OpenSocial for some months now.
You can find more informations about their work on [Damien's blog].
They contributed a lot of their work on the integration of OpenSocial in Nuxeo:
Apache Shindig (OpenSocial server) as a Nuxeo Service
GWT based OpenSocial gadget container
WebEngine based OpenSocial portal
Integration of Shindig with Nuxeo's authentication and user management
Gadget persistence API based on Nuxeo's DocumentManager
Some slides about Nuxeo and OpenSocial:
Nuxeo 5.3 integration
Based on the work contributed by Damien's team, we rebuilt the Dashboard using the GWT gadget container and Nuxeo's REST API.
Basically, all most previously available portlets are still here but now:
users can customize their dashboard
add/remove new portlets (gadgets)
change layout
it's very easy to add specific gadgets
(HTML + JavaScript, instead of JSF + Nuxeo Themes)
Here is a video showing their work in action:
Currently, OpenSocial integration into DM is limited to Gadgets. It does not expose all the infrastructure work already done, nor does it exposes all advantages we can gain from the OpenSocial standard.
In the next months, we expect to also take advantage of OpenSocial social features.
Multi-instance management
Remote Publishing
The publishing service has been replaced by a completely new Publisher Service.
The new API lets you publish a Nuxeo Document to an abstract tree.
This tree could be:
a local Nuxeo Sections tree (as before)
a Section tree hosted on a remote Nuxeo server
a filesystem tree
a custom tree pointing to an external application
This service was designed to support decentralized publishing in Nuxeo, but can also be used to publish Nuxeo document to a existing portal or web site.
Nuxeo Replication
This new add-on provides an application-level replication service.
It is used to replicate changes (Documents, Audit, Directories...) from one Nuxeo instance to another.
Because this replication is managed at the application level, you can decide which part of your data you want to replicate. This replication scope is typically defined by a NXQL request.
This service has already been used to:
provide an offline client based on Jetty DM bundle
(replicate only the documents accessible by the user).
provide staging between several Nuxeo instances
(push a whole tree to staging).
Windows integration
Browser and Office helpers
Since we are not .NET and MSOffice experts, LiveEdit and MSIE plugin have always been a pain to package.
The good news is that we found someone to help us with this, and as a first step, we did a big code cleanup, some bug fixes and all .Net builds were now integrated in the CI chain.
As a consequence, it will be easier to maintain and improve these plugins.
Any feedback on these plugins will be helpful: we don't have many MSIE and MSOffice users on site.
WSS Extensions
Windows SharePoint Services (WSS) is a set of protocols published by Microsoft that describe how SharePoint communicates with the rest of the Microsoft world.
WSS has a broad scope and contains different technologies (FrontPage extensions, WebDAV, CAML, WebServices...).
For implementing WSS extensions for Nuxeo, we focussed of the interfaces exposed by SharePoint to client applications like MS-Office and Windows Explorer.
The goal is to let MS-Office and Explorer talk to a Nuxeo server as if it was a SharePoint server.
Since part of the work is boring protocol implementation (like implementing Front Page extensions), the Nuxeo WSS extensions are implemented by two separated module:
a generic handler that does not rely on Nuxeo framework code and provides a SPI (Service Provider Interface)
the Nuxeo WSS backend which implements the SPI on top of the Nuxeo EP services (Repository, Relations, Workflow, UserManager)
In order to use WSS you will need MS-Office 2003 or 2007.
See the screencast below for these features in action:
VCS improvements
Nuxeo 5.2 was the first version to ship with VCS, or "Visible Content Store", our SQL-based backend for Nuxeo.
Nuxeo 5.3 comes with a lot of improvements of VCS.
Performances
We did a lot of performances testing on VCS, using FunkLoad (our open source functional and load testing toolkit) and the importer (see below).
Based on our results and on the feedbacks from support, we did some performance improvements.
Security checks: low-level security checks have been optimized so that even filtering several millions of documents can be very quick.
Path-based queries: VCS now manages a new "ancestors" table that allows quick queries on the path.
Proxies optimizations: proxies search now avoids costly joins.
Thanks to these optimizations, browsing and searching on a repository with several millions of documents is not an issue, even with a cheap server.
Indexing
In addition of the CMIS Join support (see below), VCS now supports multiple fulltext indexes.
Database support
VCS has now been tested "in real life" with several database vendors.
We fixed a lot of small issues related to specific vendors.
Nuxeo DM is now completely CI tested (unit tests and functional testing) against target DBs.
New APIs
Some new APIs have been introduced to manage import and replication.
Nuxeo DocumentManager now support CMIS queries (including JOINS).
CMIS
We are actively working on Chemistry and CMIS, the public review of CMIS 1.0 has just started.
We will publish a demo server based on Chemistry + Nuxeo in a few weeks.
This tool provides a way to migrate JCR based repositories to VCS.
This add-ons contains 2 parts:
one exporter (JCR on 5.1.6 and 5.2)
one importer (VCS on 5.3)
Documentation about data migration is available here.
Improving the developer experience
Jetty / Tomcat support for DM
Nuxeo distribution now supports deploying Nuxeo DM on Jetty and Tomcat.
These distributions does not include any Java EE supports (Transactions, EJB3, JCA...), but provide the same ECM features.
These packagings are mainly targeted to development environment since startup is significantly faster than a complete JEE container (less than 40s for full startup).
In the near future (one month), we will add a Transaction Manager to the tomcat package, this will make the tomcat package completly ready for production.
JBoss speed-up
The JBoss deployer has been optimized to start faster. A full Nuxeo DM startup on a laptop with JBoss takes now 1m45 instead of 2m30.
NB: this speed improvement should be very significant on windows boxes having an virus scanner intercepting all file system accesses.
Seam Hot reload support
Seam components hot reload is now supported.
This can significantly improve development speed when using the JSF framework.
GWT integration in Nuxeo has been improved so that you can now run GWT Application in hosted most with a bundled Nuxeo server deployed inside Eclipse as a Jetty server.
It is now easier to build a GWT app that uses Nuxeo Platform:
in development mode, eveything is integrated into Eclipse
we provide a single API jar for all Nuxeo services
On the second day of the CMIS face-to-face meeting we again spent some quality time reading the spec nearly line by line, making sure everything is coherent, and discussing a few important points that people felt were important for their use cases.
Below I'll outline some important changes made to the spec on the first and second day of this meeting. There's more of course, you may want to follow everything in the CMIS JIRA.
The XML and XHTML property types are gone. No vendor was in support of them, and it was actually quite hard to standardize on exactly what kind of XML would be stored in such a property (well-formed? fragment? etc.). We kept the HTML property type, as many repositories still want to distinguish between "basic text" and "rich text", especially for presentation purposes. If a repository has XML or XHTML properties, it can easily expose them as Strings.
The ability to use paths to get to folders was extended to documents as well (getFolderByPath turns into getObjectByPath). For folders (were paths are well-defined), paths are retrieved through an explicit property "cmis:path", but for documents (which may be multi-filed) we have to be more careful. Whenever a document is retrieved in the context of a folder (getChildren, getDescendants, getObjectParents), its last path segment inside that folder will be available, so that clients can determine a full path for the document — but this segment is not a real property of the document, as it may change depending on context. Finally, the "cmis:name" property will be only a hint for the repository to choose a path segment for new objects, but the only way to be sure of an object's path is through folder's cmis:path and the aforementioned document path segment.
ACLs have been available since 0.62, but the exact set of basic permissions that they can expose is hard to pin down. We had cmis:read, cmis:write, cmis:delete and cmis:all, however some vendors have a hard time mapping their native permissions (or pseudo-roles) to such a basic model, and especially to cmis:delete which in itself is ambiguous considering that in a given repository deleting an object may require some permission on the parent and some other permission on the child. To further simplify the model, it's been decided that cmis:delete would go. But fear not, the ACL model is such that each vendor has the possibility of exposing its native permissions, and exposing which of them are required for each of the CMIS operations, so clients will still be able to make good use of ACLs even if not everything about them is standardized.
With ACLs come principals, and some special principals are sufficiently common that it's worthwhile for a client to know their ids. Therefore we added a way for a repository to tell a client what's the principal id for "anonymous", what's the one for "everybody", and we added a way to specify "me" when setting ACLs.
The need for Policies has been discussed as well, as there are no actual uses of them in the spec; they're an abstract placeholder for vendor extensions. Should they go? We now have ACLs after all... But there are already vendors making use of them to expose features of their repositories, so keeping them is good for them, and costs little to others (they're optional after all).
We now have a way to do copies! For the longest of time, this wasn't the case. There was strong opposition to adding a copy method, as copy semantics is very varied among repositories (do you copy document relations? acls? versions? renditions? streams? folder children? what about multi-filing? etc.). Nevertheless myself and others persistently asked for a way to do copies. The deciding argument this time was that even though in most cases the clients can do the copy themselves, by just creating a new object with the same properties as the one to copy, there is a problem with content streams as they may be multi-gigabyte objects — at a minimum we need a way to copy content streams. After lots of discussions, we decided to introduce a createDocumentFromSource method, which works just like a normal creation except that a source document is also provided. The repository will then use whatever it feels is best from this source document to fill in the created document. Note that we don't specify a way to do folder copies, as these very too much between implementations.
In AtomPub, if you want to create a document with a content stream, you have several standard ways available. However the only way to do creation in just one call is to embed the content stream in the message, and AtomPub has strong constraints on how you can do that: for XML- or text-related content types, AtomPub mandates that the stream be inlined in clear text (presumably for the benefit of AtomPub readers). But this is problematic as soon as you want to transfer content that is slightly invalid (but nevertheless stored in your repository!), or whose text content encoding is unknown, or is XML where you want to keep exact formatting, comments, prefixes, namespaces and all. Therefore, we added an extension to AtomPub (cmisra:content) that allows base64 transfer of content in all cases.
URI templates had been added to the AtomPub bindings in order to have a non-REST but very fast way for a client to access a document by ID, by path, or to make a query without a POST. URI templates, however, are still a draft, and it's problematic to include them in a standard. Furthermore the URI templates draft specifies many different ways along which variable replacement can be done, including tests, defaults, list delimiters, escaping, etc. We thus decided that a simplified subset would be used: just simple {variable} replacement, with percent-escaping. This solves most of the problems, and is still better than nothing.
Today is the third and last day of the meeting, mostly filled with interoperability tests and still more discussions about the spec. Stay tuned for more!
Yesterday was the first day of the CMIS Technical Committee face-to-face meeting. This time we're grateful to Oracle for hosting us in their offices in Boulder, Colorado.
Here are a few highlights of what transpired during this first day.
First, CMIS is really taking hold inside the big companies in this TC. Most of them plan to make available, privately to other TC members, some test versions of the CMIS servers they are working on, to ensure interoperability as early as possible. Of course these face-to-face meetings are also designed as "plugfests", where we set up test servers and let other's clients connect to them, but it's important to have it continue beyond these three days of meetings. It's unfortunate that these servers can't be public, but it's a fact of life inside big companies that you can't publicly speak about or show what you're working on.
Of course in the open source world we have much more latitude, and Nuxeo will be putting up soon a page with instructions for downloading and using Apache Chemistry and the Nuxeo CMIS bindings, as well as a public server that people can use for testing. And with all the code available! :)
Another thing that became clear today is that everybody is pretty happy with the spec as it is, and that we're nearly ready to start the OASIS review process that will first make it go through formal public review, and then open the OASIS vote for CMIS to become a standard. This process takes time (a minimum of four month), so we should start it as early as possible. But this means that, baring problems, CMIS should be a 1.0 standard by the end of the year, which is great news!
Much of the afternoon of this first day was taken up by a paragraph-by-paragraph review of the spec, where we criticized, clarified, reworded, or otherwise discussed every aspect of the spec. This process is long but invaluable, and we all agree that it makes the spec better. It will continue tomorrow and the day after, both for the first part that describes the domain model, and for the AtomPub and SOAP bindings.
By now, most of you should have heard about CMIS, the upcoming specification that promises interoperability between many systems for common content management tasks. The CMIS specification is being driven by an OASIS Technical Committee and is currently still a draft; it is expected to be finalized late 2009 or early 2010.
I won't detail here all that CMIS will bring, this has been covered extensively already and will be even more in the future... No, the purpose of this article is to present Chemistry.
Chemistry
Chemistry is a new Apache project for CMIS that started incubating recently ("incubation" is the term used in the Apache Software Foundation for young projects that still have to prove themselves). Chemistry's goal is to provide general purposes libraries for interaction using CMIS between a server and a client. These libraries are mainly written in Java, but some JavaScript code has been added as well, and we're open to more.
Chemistry provides a high level API so that a developer can manipulate objects like documents or folders and can call simple methods on them without having to deal with details of a specific low-level communication transport. In addition to that, Chemistry also provides a SPI (Service Provider Interface) for backend developers, making it quite easy to use Chemistry to store documents in a project-specific manner.
Underlying this, Chemistry has implementations for the CMIS transports. CMIS specifies two mandatory transport protocol bindings (one extending AtomPub, for a lightweight RESTful HTTP interface, and another using SOAP for a WebService-based interface), and Chemistry will support both — and probably more in the future.
The current Chemistry code base has an initial version of the API/SPI together with some actual implementations around the AtomPub protocol. Already Chemistry can talk to itself (AtomPub client talking to AtomPub server) and store data in-memory (which is very handy for unit tests). Outside of the Apache code base, Nuxeo has also coded a backend to provide access to Nuxeo 5.2 repositories using Chemistry. Generic CMIS AtomPub clients like CMIS Explorer are able to see a Nuxeo repository through Chemistry for instance.
Chemistry Modules
The following modules will be available in Chemistry:
The APIs: a low-level SPI between a client and a server that mirrors the CMIS specification closely (it is expected that the SPI will be used when either the client or the server implements one of the HTTP protocols defined in CMIS), and a high-level API that wraps the SPI to provide more object-oriented notions of connections, folders and documents, and that hides the nitty-gritty details of the protocols.
A set of common Java utilities around CMIS, for instance a parser to turn CMIS SQL into an AST (Abstract Syntax Tree) that can be reused by different backends, or a generic in-memory implementation of the SPI and API for unit testing.
Four implementations of the SPI for the protocols defined by CMIS: an AtomPub server and client, and a SOAP server and client.
A generic implementation of the API-to-SPI wrapping, so that a third-party implementation of just the SPI can be plugged into the rest of the Chemistry framework. (Some of the four basic protocol implementations may also provide the full API when this is more efficient than using the generic wrapping.)
An implementation of the APIs as a JCR backend.
A set of generic tests for CMIS servers and client, providing an unofficial TCK for CMIS.
In the future, it is expected that more implementations of the APIs will be available, for example we envision new transports:
A WebDAV-based transport.
An HTTP-based transport less RESTish and more friendly to browsers and JavaScript.
And new backends:
A backend storing documents on the filesystem, with or without metadata.
A backend storing documents in the Google AppEngine Datastore.
A backend storing documents using Microsoft Windows SharePoint Services.
The Pieces of the Puzzle
As you can see, these modules will allow for wide interoperability between systems. Here's a graphical representation of the building blocks:
The User Application speaks the API:
The API can be implemented in many ways. First, it could be a direct backend:
Or, more commonly, the API will be implemented as a client binding for a specific protocol, SOAP of AtomPub:
Each protocol speaks in its own way on the wire:
And this is connected to a server that speaks the protocol as well:
Finally, behind the server, a backend has to store the actual information somewhere:
Anyone is welcome to create new pieces, for instance new protocol bindings:
Or new storage backends:
Now let's see how the main pieces can be plugged together.
The simplest connection is between an application and a direct backend:
If the backend only wants to deal with the SPI, its implementation can reuse the API-to-SPI to provide a full API experience:
When talking through a wire protocol, we plug together a client and a server:
The end result is an application talking to a backend through a wire protocol:
Of course we can get creative and plug many more together:
Development
All of this is still a work in progress (even the spec!), but you should expect rapid changes in the available features in the coming months as the spec settles down, more code is written, more test cases are written, and more testing against third-party implementations is done.
A few weeks ago I gave an interview to Irina Guseva of CMSWire. We touched the subjects of strategic value of CMIS, Apache Chemistry project history, partnerships, open source, future plans around CMIS, and more.
Chemistry has extremely ambitious plans. We believe that it can become the de facto bridge between most of the Java-based content-oriented products, allowing a very wide variety of back-ends and applications to be connected together. And actually Java is not the sole language that this project is targeting, as David Nuescheler is also working on a JavaScript library for CMIS. In the coming month you should see an exponential increase in the functionality that Chemistry provides...
We're the friendly employees of Nuxeo, a leading open source software vendor, which develops a complete Enterprise Content Management (ECM) software platform to help companies better produce, process, publish, archive, expose and find their information from digital assets to transactional documents.
Recent Comments
Our tweets