Re: (SC22WG20.3292) 14651 draft table updated

From: Mark Davis (mark@macchiato.com)
Date: Tue Jan 02 2001 - 12:30:40 EST


I believe the rationale is that currency signs in general (dollar, yen,
euro, etc) should not be ignorable. If you talk a look at the UCA tables
(the 14651 and UCA data are sync'ed) you will find more information about
these characters. UCA marks them explicitly, and provides for different
weightings to be set by API.

http://www.unicode.org/unicode/reports/tr10/: documentation
http://www.unicode.org/unicode/reports/tr10/basekeys.txt: one of the data
tables

Thus an API can make all of these ignorable or not. Also, because the
"alternate weighted" characters occur in a range, an API can set the top of
that range down (or up!) even without tailoring. That is, the current
boundary between alternate weighted characters and normal is between:

0BF2 ; [*0686.0020.0002.0BF2] # TAMIL NUMBER ONE THOUSAND
and
02D0 ; [.0687.0020.0002.02D0] # MODIFIER LETTER TRIANGULAR COLON

An API can set that boundary down, for example, so that only spaces are
ignored, to between:

2007 ; [*0209.0020.001B.2007] # FIGURE SPACE; COMPAT
and
005F ; [*0209.0021.0002.005F] # LOW LINE; COMPATSEQ

It could also set the boundary up, for example, so that currency signs are
also ignored, to between:

20AC ; [.06A5.0020.0002.20AC] # EURO SIGN
and
2104 ; [.06A6.0020.0002.2104] # CENTRE LINE SYMBOL

You also might find useful the collation design document that describes the
changes going on in ICU for collation. (Version #11 doesn't yet describe the
syntax/API for setting the alternate weighting boundary.)

http://oss.software.ibm.com/icu/develop/ICU_collation_design.htm
http://oss.software.ibm.com/icu/develop/draft_processed_UCA.txt

Mark

----- Original Message -----
From: "Christophe Pierret" <christophe.pierret@businessobjects.com>
To: "'Kenneth Whistler'" <kenw@sybase.com>; <sc22wg20@dkuug.dk>
Cc: <mark@macchiato.com>
Sent: Tuesday, January 02, 2001 07:12
Subject: RE: (SC22WG20.3292) 14651 draft table updated

> Topic: Some users have trouble with collation of DOLLAR SIGN in 14651
table.
>
> I am an implementer of Unicode Collation Algorithm and 14651-based
collation
> algorithm.
> We have customized the 14651 table to meet specific language needs.
> For example, English users see the original 14651 while French ones see
the
> table tailored with French-Canadian annex to 14651. (and other languages
may
> also have tailored tables)
>
> I was able to answer all questions arising from technical support relating
> to collation in our new product (where ISO 14651-like collation is
> implemented) apart from one related to the DOLLAR SIGN. Note: we migrated
> from Windows-CompareString APIs to 14651-based/Unicode Collation
Algorithm.
>
> The support engineer told me this (in french with spelling/grammar errors
> ;-)
> >J'ai a present deux cas concernants ce probleme. Et le client dit (j'ai
> >reproduit moi aussi) que les caractere "ignorable" sont (non seulement
> >espace mais) tous les characteres non-chiffre-lettre sauf "$".
>
> Which roughly translates into:
> I have now two support cases involving this issue . One of the customers
> says (I reproduced it also) that "ignorable" [4th level collated ones]
> characters are (not only space but) any character that is not a digit or
> letter apart from $.
>
> The customer, working under french Win95, seems to refer to printable
> Unicode characters that are present in Windows Western European codepage
> (1252).
>
> I answered that the choice was perhaps based on the fact that "$" is used
as
> the predominant currency sign and, in some sense, more meaningful than a
> letter to a lot of (american and financial) users.
>
> Could somebody tell me the rationale behind the choice of DOLLAR SIGN
having
> a non "IGNORE" weight at the first level as opposed to other "ascii" chars
> like the # NUMBER SIGN for example ?
>
> Annex:
>
> From: http://www.iso.ch/ittf/ISO14651_2000_TABLE1.htm
>
> <U0024> <S0024>;<BASE>;<MIN>;<U0024> % DOLLAR SIGN
> <U0023> IGNORE;IGNORE;IGNORE;<U0023> % NUMBER SIGN
>
>
> Christophe Pierret
> Technical Lead, Analytical Reporting Web Products
>
> Business Objects S. A.
> 157/159, rue Anatole France
> 92309, Levallois-Perret, France
> Telephone: (33) 1 41 25 32 52
> Fax: (33) 1 41 25 31 00
> E-mail: cpierret@businessobjects.com
>
> Corporate disclaimer:
>
> STRICTLY PERSONAL AND CONFIDENTIAL
> This email may contain confidential and proprietary material for the sole
> use of the intended recipient. Any review or distribution by others is
> strictly prohibited. If you are not the intended recipient please contact
> the sender and delete all copies.
>
> > -----Original Message-----
> > From: Kenneth Whistler [mailto:kenw@sybase.com]
> > Sent: Tuesday, December 12, 2000 2:05 AM
> > To: sc22wg20@dkuug.dk
> > Cc: kenw@sybase.com; mark@macchiato.com
> > Subject: (SC22WG20.3292) 14651 draft table updated
> >
> >
> > WG20'ers:
> >
> > I have posted up an updated draft of the Common Template Table
> > for Amendment 1 to ISO 14651.
> >
> > sc22wg20@anubis.dkuug.dk/datafiles/symdump-3.0.1d4.txt">ftp://sc22wg20@anubis.dkuug.dk/datafiles/symdump-3.0.1d4.txt
> >
> > (As noted before, you may have to go in with a command line ftp
> > program to get there. A browser can't handle that URL, since
> > logging in to anubis.dkuug.dk as sc22wg20 seems to engage in
> > some fancy redirection into some private area.)
> >
> > This draft is a minor update, where I am trying to stay current
> > with some of the input I have received to date.
> >
> > 1. Thaana fixes. The dotted Arabic Thaana letters are reordered now to
> > accord with what appears to be accepted practice. And with
> > John's input and today's information from Michael Everson, it
> > appears that the Thaana vowels also get primary weights. I
> > have restructured the table accordingly.
> >
> > 2. The Yi radicals were not being weighted comparably to the
> > CJK radicals -- something that was causing their primary weights
> > to be too high for elements that are going to be ignorable
> > symbols. That was corrected, so they now (at least preliminarily)
> > are treated as parallel with the CJK radicals. That fixes a
> > problem with implementations of the Unicode Collation Algorithm
> > on these characters as well.
> >
> > 3. The two Sindhi symbols (06FD and 06FE) were missing their
> > collation equivalents to hamza + variant and meem + variant,
> > respectively. Those have been added, and their weights are now
> > reasonable. (Look under the respective Arabic letters.)
> >
> > I have a question outstanding among Inuktitut experts regarding the
> > ordering of some elements of UCAS for Nunavut and Nunavik. More
> > on that later.
> >
> > Also, I have received considerable input on Khmer ordering issues
> > from Maurice Bauhahn, but have some outstanding questions that need
> > to be resolved before attempting to roll the results into the table.
> > The resolution of Khmer sorting should also shed some light on
> > what to do with Myanmar, which shares a number of structural
> > similarities with Khmer.
> >
> > Some structural issues brought up by Mark Davis and by Kent
> > Karlsson are postponed for further study, as their resolution will
> > likely require more delicate surgery on the sifter. Some of that
> > is scheduled for the last week of the year. After that I'll also
> > make available the related Unicode Collation Algorithm tables,
> > as well as the symdump table for 14651.
> >
> > Make sure you review the sections of the table you have an interest
> > in and get your feedback to me. The sooner you can provide feedback,
> > the more likely the results will get into the draft table we use
> > as the basis for developing a PDAM draft at the spring WG20
> > meeting.
> >
> > --Ken
> >



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:17 EDT