|
|
|
Resume about the french grammar checker projectI have recently started to work on the project of a free french grammar checker which could be implemented in OpenOffice.org. Myriam Lechelt had initiated this project 2 years ago by adapting Gramadoir, a gaelic grammar checker developped by Kevin Scanell. But this tool appeared not to be very suitable for french grammar. Myriam had also analyzed other grammar checker, amongst which LanguageTool. It is a rule-based style and grammar checker initially developped for English by Daniel Naber, and then extended to German, Polish or Hungarian. It was rejected at the time by Myriam, but it has progressed a lot. So we have decided to work on the new version for our french grammar checker. In her work, Myriam has given leads to create a new grammar checker for French. For example, she advises to segment sentences in chunks, between the sentence and the word. She also suggests to use grammar unification and feature structure to find grammar mistakes. How do grammar checkers work ?First of all, a tokenizer segments the text into sentences and words. Then a tagger gives tag(s) to each token, containing morphosyntactic information like the gender, the number, the tense, the person, etc...Many words have several tags, so a disambiguation is often necessary to eliminate inappropriate tags in certain contexts and to keep only the good tag. The method to disambiguate can either be statistic or rule based. The statistic methods needs a learning tagged corpus, and then the grammar checking is very dependent on this corpus. The rule-based method requires a large number of hand-made rules, describing the context in which a word must have a certain tag. This second method is easier to control. Finally, a pattern matching with the text and rules is used in the grammar checking. It can either be grammar rules or error rules. The first describe what is good and everything that does not match them is considered as false. This can be very annoying since it can wrongly detect many errors if rules are not exhaustive. On the contrary, error rules describe what is wrong and everything that matches them is considered as false. But even if rules are very numerous, as it is impossible to anticipate all mistakes, there will always be not detected errors. But this is preferable to wrong detections. About LanguageToolLanguageTool is a style and grammar checker developped by Daniel Naber. It is composed of several parts in java successively proceeding to the tokenization in sentences and words, the tagging and the detection of grammar mistakes.There is no disambiguation after the tagging, so many words can have more than one tag. But a disambiguator interface has been implemented for the languages for which it is a problem not to have disambiguation. The detection of errors is based on error rules formalized in XML. Each rule has an identifier (id), a name, a pattern describing the context of the mistake, a message explaining the mistake, and examples to show a correct and incorrect sentence corresponding to the mistake. Tests with French and problems encounteredWe have tested the few rules ported from An Gramadóir and written by Myriam Lechelt. We have immediately noticed that the absence of disambiguation would be an important problem for French checking. Indeed, the detection of mistakes almost always failed because of ambiguous words.We have tried to get round this problem by modifying the rules to take ambiguity in account, but we realized that it would be very tedious to build every rule like that. The best solution is to implement a disambiguator after the tagging. That is what we will try to do. We also became aware of the problem raised by the structure of the rules, and more precisely of the pattern in the rules and the method of rigid pattern matching. It requires the description of all contexts in which a mistake can be found, that is to say the description of all possible combinations of words, and a rule for each one. But it is just impossible to anticipate all of them. We could only write a very large number of rules which would never be exhaustive, and which would be costly for the processing. An alternative with chunks and unificationAccording to Abney, "The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template" (S. P. Abney, 1991, Parsing by chunks).The internal structure of a chunk is fixed, but function words inside are all dependant of the lexical head and agree with it. In the sentence, chunks agree whit each other, and they can easily permute, contrary to words in a chunk. Feature structures describe each element in a sentence with a list of pairs feature-value. Unification consists in matching the feature structures of different elements. The matching failes if a feature does not have the same value in the feature structures of the different elements tested. The use of both chunks and unification is a very interesting alternative. It can make grammar checking really easier. First, by unifying features only, and not grammatical category, we reduce considerably the number of necessary rules, since we do no more need to enumerate all possible combinations of words. Then, the relations between chunks will be very helpful for some checkings, like the aggreement with the subject and the verbal chunk, or more generally for all agreement with distant words. Indeed, distant relations cannot be checked with a system only applying pattern matching on the immediate context. DisambiguationMyriam Lechelt has built many disambiguation rules for An Gramadóir. We intended to port them to LanguageTool, so we have analyzed java files to see how we could add them, and we have thought about how to improve disambiguation.It would be logical to rewrite the rules in XML, since it is the formalism used by LanguageTool for all rules. Moreover, XML rules can be more easily understood and maintained by linguists who are not necessarily computer scientists. It could be preferable, in some cases, not to disambiguate a word totally, but only the grammatical category. Because of an ambiguity of features, some mistakes may not be detected, but with a bad disambiguation of features, wrong mistakes can be detected, which is much more annoying for the user. We have thought about disambiguation, how to improve it and how to port rules to LanguageTool. But in fact, we have finally decided not to implement disambiguation now, since we lack time and it is more important for us to improve grammar checking. Instead, we will tag and disambiguate sentences from a corpus of mistakes, and we will use these sentences for the next step, that is to say the grammar checking. Important announcement: Join the Nuxeo team and contribute to the Nuxeo project! We have open positions in France and the UK for open source Java EE developers and sales engineers, both junior and senior. |
Nuxeo Bloggers: Log in! Search Nuxeo Blogs
Archives
Categories
Nuxeo Bloggers
Photos and Pictures
|
|
Nuxeo -
Indesko -
Nuxeo 5 Project
All content is copyrighted by their author. CPSSkins is Copyright © 2003-2006 by Jean-Marc Orliaguet. | CPS is Copyright © 2002-2006 by Nuxeo SAS. |