Re: French group separators

From: Martin JD Green (mjdgreen@rainbow4.demon.co.uk)
Date: Tue Jul 08 2003 - 06:53:23 EDT

  • Next message: Venugopala Rao Moram: "Re: UTF-8 to UTF-16LE"

    From: "John Burger" <john@mitre.org>

    > From: "Philippe Verdy" <verdy_p@wanadoo.fr>
    >
    > > Unicode already defines with character properties those punctuations
    > > that terminate sentences. Why would you need to recognize sequences of
    > > two spaces as meaning an end of sentence???
    >
    > Ambiguity remains. My colleague David Palmer did some testing of
    > various algorithms:
    >
    > http://citeseer.nj.nec.com/palmer97adaptive.html
    >
    > The simplest heuristic approach, slightly more sophisticated than the
    > Emacs regular expression someone mentioned, misclassified periods about
    > 8% of the time on an annotated Wall Street Journal corpus. David's
    > SATZ program, which uses a neural net or a decision tree trained on a
    > similar corpus, got just above a 1% error rate. A Flex-based English
    > tokenizer I had built previously got down to 0.9%, using a list of 75
    > common abbreviations and about 100 rules (not all of which had to do
    > with sentence-boundary disambiguation). Some later work that David and
    > I did combined the latter two approaches. If I remember correctly, the
    > amalgam had a 0.5% error rate on the same evaluation corpus.
    >
    > SATZ's results on French and German data were better, hovering around
    > 0.5% - there was less period-ambiguity in those corpora.
    >
    > Like many natural language phenomena, this problem is harder than some
    > think, at first glance.
    >
    > - John Burger
    > MITRE
    >
    I am not a "typographer" by any stretch of the imagination and know little
    of why adding two spaces after the end of a sentence came about. However,
    when I was first presented with the problem of automatically "setting text"
    for the early laser printers from Hewlett Packard back in 1978, I ran into
    this peculiarity.

    During the testing phase of the project we collected documents from around
    the company to see how well they were printed. About half had two spaces
    following the end of sentences. These documents were produced in general by
    our senior secretaries. Almost none of the "technical" documents produced by
    the engineers had the two spaces. It is worth noting that a high proportion
    of those that had two spaces after a dot to mark the end of a sentence also
    had two spaces after exclamation and question marks.

    I did do a little research into why two spaces were used. It appeared that
    the "best" secretarial schools in the London area taught this as the
    "correct" way to type up a document. This was still the case in the latest
    recruits from these schools. I also talked to my mother-in-law who was a
    senior secretary for an international bank. She repeated that she had been
    taught to always use two spaces to mark the end of sentences (i.e. following
    the dot, exclamation and question mark). Her understanding was that it
    related back to a Victorian rhyme on how to "read" documents aloud. At the
    end of sentences one paused for a count of 2, while for a comma one paused
    for a count of 1. Unusually this same rhyme stated that following a colon or
    semi-colon one paused for a count of 1 and a half!

    This had lead to the typists house rule that sentences needed twice as much
    space as normal. Then if the document was to be spread to fill the margins
    that space would be allotted to end of sentences first, after colons and
    semi-colons next, then commas and finally between words if all else failed
    but spaces within abbreviations were never to be expanded. These house rules
    were finally coded into the product with some "clever" software to detect
    which spaces were which. I'm not sure of our failure rate but laser printers
    were crude at the time and I don't think anyone would have been able to
    tell!

    Martin Green



    This archive was generated by hypermail 2.1.5 : Tue Jul 08 2003 - 07:38:56 EDT