Last week I had the pleasure to attend the second workshop organized by the IKS project in Rome. The goal of this 4 years project is to develop a software stack and a set of design guidelines to help CMS developers leverage the promises of knowledge oriented software and Linked Data.
In the following I will give a brief overview of some of the discussions that happened during those four days and a summary of the Scribo project I presented during the demo sessions the last day. A more complete coverage of the event can be found the event page of the IKS wiki.
Materialized semantic indexes
Rupert Westenthaler from the Salzburg Research team is working on a very interesting prototype to make CMS applications able to perform fast complex graph queries on a knowledge base by materializing named graph queries into flat Lucene indexes and tracking the knowledge base changes to detect when the indexes need incremental updates.
To me this sounds a lot like the permanent MapReduce views used to query the CouchDB document database. I really look forward to the release of the first prototype along with some benchmarks to compare this approach with general purpose un-materialized SPARQL engines such as Jena SDB / TDB, Sesame and Virtuoso.
Bridging CMIS and RDF/OWL
During the workshop Gokce Laleci introduced a prototype mapper from JCR to RDF, from content structure to explicit semantic knowledge. The goal is to express the underlying structure (document types and properties) specific to a given CMS content store as a standard based and interoperable knowledge view that can be directly aggregated by Linked Data crawlers.
She and her team will now work on a similar mapper for the CMIS protocol using Nuxeo DM and Apache Chemistry as primary integration platform. Another interesting lead would be to translate SPARQL queries in CMISQL when the mapping makes sense and hence allow any CMIS content repository to behave as a SPARQL endpoint.
In the long term, I am not sure whether we want to keep the content and knowledge in separate stores as we do currently in Nuxeo (Nuxeo Core and Jena). It might be simpler and more efficient to combine them both in the Core and use such a configurable knowledge mappers along with materialized graph queries to implement the semantic features of Nuxeo.
Ontology-free semantic indexing
Stephane Gamard (@sgamard) introduced the services offered by the SalsaDev platform. Their startup focuses on leveraging an algorithm able to semantically index any text document such as blog posts, web page snippets, wikipedia articles and look up semantically related documents in all indexed content without relying on explicit ontologies or topic classification. Their approach offers the same advantages as Latent Semantic Analysis but is also scalable to very large document collections while LSA suffers from quadratic lookup times that makes it unusable in practice.
This approach is very similar to a semantic hashing prototype I have been working on my idle weekends for quite some time now. The short term goal is to implement an image search by similarity feature for the future Nuxeo Digital Asset Management product. On a longer term the same algorithm should be adapted to also work for text document similarity search.
Those purely data-driven approaches are interesting for at least two reasons:
- they allow for a natural implementation of the unstructured "query by example" paradigm,
- they can be combined with more structured semantic extractions to perform disambiguation in a named entities recognition component for instance.
Using UIMA for economic intelligence
Tommaso Teofili (@tommasoteofili) from the Apache UIMA team demoed a real application of semantic knowledge extraction to monitor the temporal evolution real estate market prices in the Rome area. The assets prices data categorized by surface and number of rooms is automatically extracted from the raw unstructured content of public ads web pages and aggregated in a relational database that feeds a charting and reporting user interface.
The data extraction magic is performed by a UIMA chain that wraps the online semantic engines provided by the AlchemyAPI web service. Such semantic lifting services are typically what Nuxeo aims to provide as part of the platform without relying on third party service providers.
Incidentally Peter Mika from Yahoo! Research is working on a similar prototype to find his next flat in Barcelona.
Nuxeo and automated Semantic Knowledge extraction
As part of the demo session, I chose to present some of the ongoing work done by Nuxeo and its partners as part of the Scribo project. One of the goals of this project is to extract the occurrences of entities (such as persons, organizations and places) semantic assertions between those entities ("Person A" is the CEO of "Company B" or "Person B" has declared that "he will reform the Health care system"). To that hand we chose to package annotators as chained UIMA Analysis Engines and store the extracted semantic annotations as RDF assertions using the classes of the DBPedia ontology. Here are the slides introducing the context of the demo:
The demo itself is two-fold. The first part features the Scribo Workbench mainly developed by XWiki to configure and test a chain of UIMA annotators to extract semantic knowledge from the text content of documents coming from heterogeneous content repositories such as a filesystem folder, a CMIS repository (Nuxeo DM) or an XWiki server accessed through its RESTful API.
The user can then combine such a document source with one or several registered annotators into a UIMA chain (a.k.a Collection Processing Engine), run the process and view the results as annotated text document directly in the Eclipse UI. The user can also validate or invalidate the extracted annotations and hence incrementally build a validated knowledge base of semantic statements out of his unstructured content. The following screencast shows the details of this scenario using the Stanford Named Entity Recognition annotator on 2 wikinews articles:
The second part of the demo showcases the deployment of the previous UIMA chain directly inside a Nuxeo DM 5.3 instance. PDF documents are directly semantically annotated at import time thanks to an asynchronous event listener that calls a new UIMARunnerService packaged as an OSGi components deployed by the Nuxeo Runtime.
The extracted named entities are stored in the default Nuxeo Jena store. Some work is still needed to make the annotations show up correctly in the "preview tab" and make it possible to validate / invalidate extractions from the "knowledge base" tab.
Remember this is just the beginning and we plan to support all languages significantly represented in Wikipedia along with finer grained entity classes. You can also get an overview on the global semantic R&D effort at Nuxeo on our Jira.
Last but not least, the showcased demo is deployable on your own Nuxeo DM 5.3 instance by deploying a simple plugin as explained on the UIMA page of the Nuxeo wiki. Beware that this is really alpha alpha work and should not be deployed on a production setup.