From: John Burger (email@example.com)
Date: Mon Jul 07 2003 - 22:11:08 EDT
From: "Philippe Verdy" <firstname.lastname@example.org>
> Unicode already defines with character properties those punctuations
> that terminate sentences. Why would you need to recognize sequences of
> two spaces as meaning an end of sentence???
Ambiguity remains. My colleague David Palmer did some testing of
The simplest heuristic approach, slightly more sophisticated than the
Emacs regular expression someone mentioned, misclassified periods about
8% of the time on an annotated Wall Street Journal corpus. David's
SATZ program, which uses a neural net or a decision tree trained on a
similar corpus, got just above a 1% error rate. A Flex-based English
tokenizer I had built previously got down to 0.9%, using a list of 75
common abbreviations and about 100 rules (not all of which had to do
with sentence-boundary disambiguation). Some later work that David and
I did combined the latter two approaches. If I remember correctly, the
amalgam had a 0.5% error rate on the same evaluation corpus.
SATZ's results on French and German data were better, hovering around
0.5% - there was less period-ambiguity in those corpora.
Like many natural language phenomena, this problem is harder than some
think, at first glance.
- John Burger
This archive was generated by hypermail 2.1.5 : Mon Jul 07 2003 - 23:09:38 EDT