[OOo] New Thesaurus file format for OOo 2.0

The thesaurus file format will change from OOo version 1.x to 2.x

The engine, myThes has been developped by Kevin Hendricks (OOo lingucomponent project lead). A standalone version is available at
http://lingucomponent.openoffice.org/thesaurus.html
 
The new format is based on WordNet from Priceton Univerity
http://www.cogsci.princeton.edu/~wn/

The main changes introduced are
  • datas are now plain text, no binary anymore
  • each entry can have multiple meanings and can be morphologically tagged

This new format is incompatible with old one. So existing thesaurus will not work in OOo 2.0

I'm working on a small program translating the old thesauruses to new format. It is an OOo macro accessing thesaurus API (mainly the com.sun.star.linguistic2.Thesaurus service available in OOo 1.1.x and the old .idx file which is plain text).
Once the data transformed (the .dat file is created), the new index .idx file is generated using a perl script Kevin wrote.
It is almost finished and will be released under free licence so that other native-lang OOo projects can transform their own thesaurus if needed.

Concerning morphological informations (verb, noun, adjective  ...) that are actually missing for all entries, Myriam's work (see her blog) will be of great help generating these informations.

Important announcement: Join the Nuxeo team and contribute to the Nuxeo project! We have open positions in France and the UK for open source Java EE developers and sales engineers, both junior and senior.

Like this post? Share it:

Posted by Laurent Godard @ 03/03/2005 12:34 PM. - Categories: openoffice -  0 comments

Nuxeo Bloggers: Log in!
Nuxeo - Indesko - Nuxeo 5 Project
All content is copyrighted by their author.
CPSSkins is Copyright © 2003-2006 by Jean-Marc Orliaguet. | CPS is Copyright © 2002-2006 by Nuxeo SAS.