Re: French group separators

From: Martin JD Green (mjdgreen@rainbow4.demon.co.uk)
Date: Tue Jul 08 2003 - 06:53:23 EDT

Next message: Venugopala Rao Moram: "Re: UTF-8 to UTF-16LE"

Previous message: santhosh kumar: "UTF-8 to UTF-16LE"
In reply to: John Burger: "Re: French group separators"
Next in thread: John Burger: "Re: French group separators"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "John Burger" <john@mitre.org>

> From: "Philippe Verdy" <verdy_p@wanadoo.fr>
>
> > Unicode already defines with character properties those punctuations
> > that terminate sentences. Why would you need to recognize sequences of
> > two spaces as meaning an end of sentence???
>
> Ambiguity remains. My colleague David Palmer did some testing of
> various algorithms:
>
> http://citeseer.nj.nec.com/palmer97adaptive.html
>
> The simplest heuristic approach, slightly more sophisticated than the
> Emacs regular expression someone mentioned, misclassified periods about
> 8% of the time on an annotated Wall Street Journal corpus. David's
> SATZ program, which uses a neural net or a decision tree trained on a
> similar corpus, got just above a 1% error rate. A Flex-based English
> tokenizer I had built previously got down to 0.9%, using a list of 75
> common abbreviations and about 100 rules (not all of which had to do
> with sentence-boundary disambiguation). Some later work that David and
> I did combined the latter two approaches. If I remember correctly, the
> amalgam had a 0.5% error rate on the same evaluation corpus.
>
> SATZ's results on French and German data were better, hovering around
> 0.5% - there was less period-ambiguity in those corpora.
>
> Like many natural language phenomena, this problem is harder than some
> think, at first glance.
>
> - John Burger
> MITRE
>
I am not a "typographer" by any stretch of the imagination and know little
of why adding two spaces after the end of a sentence came about. However,
when I was first presented with the problem of automatically "setting text"
for the early laser printers from Hewlett Packard back in 1978, I ran into
this peculiarity.

During the testing phase of the project we collected documents from around
the company to see how well they were printed. About half had two spaces
following the end of sentences. These documents were produced in general by
our senior secretaries. Almost none of the "technical" documents produced by
the engineers had the two spaces. It is worth noting that a high proportion
of those that had two spaces after a dot to mark the end of a sentence also
had two spaces after exclamation and question marks.

I did do a little research into why two spaces were used. It appeared that
the "best" secretarial schools in the London area taught this as the
"correct" way to type up a document. This was still the case in the latest
recruits from these schools. I also talked to my mother-in-law who was a
senior secretary for an international bank. She repeated that she had been
taught to always use two spaces to mark the end of sentences (i.e. following
the dot, exclamation and question mark). Her understanding was that it
related back to a Victorian rhyme on how to "read" documents aloud. At the
end of sentences one paused for a count of 2, while for a comma one paused
for a count of 1. Unusually this same rhyme stated that following a colon or
semi-colon one paused for a count of 1 and a half!

This had lead to the typists house rule that sentences needed twice as much
space as normal. Then if the document was to be spread to fill the margins
that space would be allotted to end of sentences first, after colons and
semi-colons next, then commas and finally between words if all else failed
but spaces within abbreviations were never to be expanded. These house rules
were finally coded into the product with some "clever" software to detect
which spaces were which. I'm not sure of our failure rate but laser printers
were crude at the time and I don't think anyone would have been able to
tell!

Martin Green

Next message: Venugopala Rao Moram: "Re: UTF-8 to UTF-16LE"
Previous message: santhosh kumar: "UTF-8 to UTF-16LE"
In reply to: John Burger: "Re: French group separators"
Next in thread: John Burger: "Re: French group separators"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jul 08 2003 - 07:38:56 EDT