Re: French group separators

From: John Burger (
Date: Mon Jul 07 2003 - 22:11:08 EDT

  • Next message: Allen Haaheim: "Re: French group separators"

    From: "Philippe Verdy" <>

    > Unicode already defines with character properties those punctuations
    > that terminate sentences. Why would you need to recognize sequences of
    > two spaces as meaning an end of sentence???

    Ambiguity remains. My colleague David Palmer did some testing of
    various algorithms:

    The simplest heuristic approach, slightly more sophisticated than the
    Emacs regular expression someone mentioned, misclassified periods about
    8% of the time on an annotated Wall Street Journal corpus. David's
    SATZ program, which uses a neural net or a decision tree trained on a
    similar corpus, got just above a 1% error rate. A Flex-based English
    tokenizer I had built previously got down to 0.9%, using a list of 75
    common abbreviations and about 100 rules (not all of which had to do
    with sentence-boundary disambiguation). Some later work that David and
    I did combined the latter two approaches. If I remember correctly, the
    amalgam had a 0.5% error rate on the same evaluation corpus.

    SATZ's results on French and German data were better, hovering around
    0.5% - there was less period-ambiguity in those corpora.

    Like many natural language phenomena, this problem is harder than some
    think, at first glance.

    - John Burger

    This archive was generated by hypermail 2.1.5 : Mon Jul 07 2003 - 23:09:38 EDT