From: John Burger (john@mitre.org)
Date: Mon Jul 07 2003 - 22:11:08 EDT
From: "Philippe Verdy" <verdy_p@wanadoo.fr>
> Unicode already defines with character properties those punctuations
> that terminate sentences. Why would you need to recognize sequences of
> two spaces as meaning an end of sentence???
Ambiguity remains. My colleague David Palmer did some testing of
various algorithms:
http://citeseer.nj.nec.com/palmer97adaptive.html
The simplest heuristic approach, slightly more sophisticated than the
Emacs regular expression someone mentioned, misclassified periods about
8% of the time on an annotated Wall Street Journal corpus. David's
SATZ program, which uses a neural net or a decision tree trained on a
similar corpus, got just above a 1% error rate. A Flex-based English
tokenizer I had built previously got down to 0.9%, using a list of 75
common abbreviations and about 100 rules (not all of which had to do
with sentence-boundary disambiguation). Some later work that David and
I did combined the latter two approaches. If I remember correctly, the
amalgam had a 0.5% error rate on the same evaluation corpus.
SATZ's results on French and German data were better, hovering around
0.5% - there was less period-ambiguity in those corpora.
Like many natural language phenomena, this problem is harder than some
think, at first glance.
- John Burger
MITRE
This archive was generated by hypermail 2.1.5 : Mon Jul 07 2003 - 23:09:38 EDT