RE: Question on script-name assignment

From: Tom Emerson (tree@basistech.com)
Date: Fri Nov 09 2001 - 16:40:35 EST


Marco Cimarosti writes:
> Tom Emerson wrote:
> > One gotcha, that I run into every six months or so, is forgetting that
> > the punctuation characters in the Basic Latin block are classified as
> > Latin script. This trips me up because most of my text processing work
> > involves CJK, so I'll write something to filter latin characters with
> > (in Rosette notation):
>
> That must be a Rosette-specific behavior: in UTR#24 (and in its database
> <Scripts.txt>), the only ASCII-range code-points classified as "Latin" are
> the upper- and lower-case letters.

Indeed. It turns out that the Rosette script assignments (in the
version I'm using) predate UTR#24 by three or four years and are based
on the information in <blocks.txt> with some hand editing by engineers
long past.

The next major Rosette release, which includes Unicode 3.1 support,
will use the data from UTR#24, and my problem will mostly go away.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Computational Linguist                         http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



This archive was generated by hypermail 2.1.2 : Fri Nov 09 2001 - 17:32:54 EST