Re: New version of TR29:

From: John Burger (john@mitre.org)
Date: Thu Aug 15 2002 - 13:51:57 EDT


My immediate reaction to this TR was that it was doomed, given how
difficult it is to tokenize text perfectly (I have written a number of
tokenizers for natural language processing, and they are never
complete). However, after reading the draft, I found myself agreeing
that it is reasonable to provide =some= guidance for the 80% solution.
So, I looked at the code for some of my tokenizers. Most of the special
cases covered there are not appropriate for the TR, but I do have the
following suggestion:

Consider adding U+0026 (ampersand) to the MidLetter class. I did a
quick scan through a few million words of New York Times data I have,
and found that most mid-word occurrences would probably not induce word
breaks, e.g.,

   Q&A
   R&R
   AT&T
   P&G
   ...

Exceptions included:

   Ben&Jerry
   How&Why

Perhaps a more conservative rule would involve only uppercase letters ...

A caveat: I am unfamiliar with analogous cases in languages other than
English.

- John Burger
   MITRE



This archive was generated by hypermail 2.1.2 : Thu Aug 15 2002 - 12:00:22 EDT