Re: French group separators

From: John Burger (john@mitre.org)
Date: Mon Jul 07 2003 - 22:11:08 EDT

Next message: Allen Haaheim: "Re: French group separators"

Previous message: John Hudson: "Re: Yerushala(y)im - or Biblical Hebrew"
In reply to: Philippe Verdy: "Re: French group separators"
Next in thread: Martin JD Green: "Re: French group separators"
Reply: Martin JD Green: "Re: French group separators"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Philippe Verdy" <verdy_p@wanadoo.fr>

> Unicode already defines with character properties those punctuations
> that terminate sentences. Why would you need to recognize sequences of
> two spaces as meaning an end of sentence???

Ambiguity remains. My colleague David Palmer did some testing of
various algorithms:

http://citeseer.nj.nec.com/palmer97adaptive.html

The simplest heuristic approach, slightly more sophisticated than the
Emacs regular expression someone mentioned, misclassified periods about
8% of the time on an annotated Wall Street Journal corpus. David's
SATZ program, which uses a neural net or a decision tree trained on a
similar corpus, got just above a 1% error rate. A Flex-based English
tokenizer I had built previously got down to 0.9%, using a list of 75
common abbreviations and about 100 rules (not all of which had to do
with sentence-boundary disambiguation). Some later work that David and
I did combined the latter two approaches. If I remember correctly, the
amalgam had a 0.5% error rate on the same evaluation corpus.

SATZ's results on French and German data were better, hovering around
0.5% - there was less period-ambiguity in those corpora.

Like many natural language phenomena, this problem is harder than some
think, at first glance.

- John Burger
MITRE

Next message: Allen Haaheim: "Re: French group separators"
Previous message: John Hudson: "Re: Yerushala(y)im - or Biblical Hebrew"
In reply to: Philippe Verdy: "Re: French group separators"
Next in thread: Martin JD Green: "Re: French group separators"
Reply: Martin JD Green: "Re: French group separators"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jul 07 2003 - 23:09:38 EDT