Re: Computing default UCA collation tables

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 20 2003 - 20:58:38 EDT

Next message: Roozbeh Pournader: "Re: Persian or Farsi? (was RE: Decimal separator with more than one c haracter?)"

Previous message: Philippe Verdy: "Re: Computing default UCA collation tables"
Maybe in reply to: Philippe Verdy: "Computing default UCA collation tables"
Next in thread: Mark Davis: "Re: Computing default UCA collation tables"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy asked:

> > The QQKN entries are compatibility decompositions involving expansions.
> > The QQKM entries are compatibility decompositions with combining
> > sequences involved (just a few Arabic positional forms with harakat).
>
> Although comments are "informative", it would make sense to explain
> and verify that they can be considered as such, and so we can trust
> themthat they involve such expansions and combining sequences

That is certainly feasible, although low priority. Another
possibility is that if these comments are causing confusion, they
can just be suppressed in the next version of allkeys.txt.

> (however this may just be a temporary solution that may change
> later without notice if other requirements are needed to satisfy
> ISO14651 needs).

It has nothing to do with ISO/IEC 14651 needs. No such comments
are printed in the 14651 table; the CTT table has a completely
separate set of comments, not on a per-character basis.

> > In the input source file I have been citing.
>
> You just quote a few entries from your file. Do you mean I
> still need to discover them from the careful analysis of
> "allkeys.txt" ?
> Or your input file is published somewhere (even though
> it is not normative as it contains "private use" characters).

In the past, versions have been made available to the encoding
committees reviewing the final tables.

John Cowan also inquired about this. I suppose if enough people
feel it is useful, the input file could be published as part
of the UCA (as a link from the text), with the appropriate
caveats. There isn't much that people could do with it,
however, except look at it to figure out how the normative
table (allkeys.txt) got some of the weight values it did.

> Your input file just appears to reuse the same format as UCD,

Yes, the entries for the input file are extracted from the UCD,
omitting irrelevant fields. They then need to be further
processed, to establish the primary ordering and to deal
with all the special cases.

> So my question is: do your "ad hoc" decomposition use
> **ALL** NFKD decompositions (in addition to "adhoc" collation
> decompositions), or is there some possible exceptions
> that need to be decomposed differently in your file?

There are exceptions. You saw an example in the PROSGEGRAMMENI
I cited earlier, but, for example, all the compatibility
decompositions involving the equating of spacing diacritics
to SPACE plus nonspacing marks are suppressed:

#00B4;ACUTE ACCENT;Sk;<compat> 0020 0301;;;;;
00B4;ACUTE ACCENT;Sk;;;;;;

This is for consistency with the comparable ASCII characters
(U+0060 GRAVE ACCENT, U+005E CIRCUMFLEX ACCENT, and U+005F LOW LINE)
and to better match legacy collation treatment of these characters.

Other characters, which *should* have compatibility decompositions
but do not, have "<sort>" decompositions added in the input
file:

2E88;CJK RADICAL KNIFE ONE;So;<sort> 5200 F8F0;;;;;
2E89;CJK RADICAL KNIFE TWO;So;<sort> 5202;;;;;

> It's an important question because UCA currently only clearly
> says that it only supports canonical (NFD) decompositions and
> says nothing about compatibility decompositions (this is logical
> for localizable collations, but probably not for DUCET).

By the way, it is less confusing, in this context, to talk about
canonical decompositions and compatibility decompositions,
without introducing the terms "NFD" and "NFKD", which refer
to normalization forms, instead.

What the UCA states, and the DUCET reflects, is that collation
weights for canonically equivalent sequences should also be
equal.

A *reasonable* approach for characters which have compatibility
decompositions is to weight them by their compatibility
decompositions, but this is not a *required* principle, as
for the canonical decompositions. And not all of the
compatibility decompositions make sense for collation. There is no
particular reason why they should, since they were assigned
in the first place according to different principles, and
definitely not on the premise that using them would produce
an optimal sorting order. Because of this, the definition of
the default table for UCA (and for 14651) makes numerous
adjustments** to compatibility decompositions (as shown in
the examples above), to make them more useful for producing
the desired results for collation weighting, which, after
all, is the *point* of the table in the first place.

--Ken

** Lest people start saying, "My gosh! I thought all
decompositions in UnicodeData.txt were normative and could
not be 'fixed'", let me clarify. The decomposition mapping
field in UnicodeData.txt *is* normative (and immutable),
and the UCA doesn't modify that. The UCA, unlike
normalization, does not claim to be making direct use of
the normative compatibility mappings, either. Instead,
the UCA input file creates its own, ad hoc, equivalences,
to solve its own problem: namely, collation weighting.
Those equivalences, expressed as decompositions, are based
*mostly* on compatibility decomposition mappings from
UnicodeData.txt, but as I have illustrated, are somewhat
arbitrarily extended, omitted, and/or tinkered with, to
produce better collation results.

People who worry about Unicode normalization forms not
quite meeting their needs might want to consider the
collation algorithm as a precedent. Nobody should take the
compatibility decompositions at face value as meeting
all equivalencing needs. Doing so is bound to produce one
or another kind of unexpected result, depending on what
you are doing with them. It is OKAY(tm) to create your
own equivalences between Unicode characters to produce
the desired results for particular processing. What is
NOT OKAY(tm) is to claim that such custom equivalences
are a replacement for the compatibility decomposition
mappings in UnicodeData.txt for the purposes of formal
Unicode normalization (by UAX #15).

Next message: Roozbeh Pournader: "Re: Persian or Farsi? (was RE: Decimal separator with more than one c haracter?)"
Previous message: Philippe Verdy: "Re: Computing default UCA collation tables"
Maybe in reply to: Philippe Verdy: "Computing default UCA collation tables"
Next in thread: Mark Davis: "Re: Computing default UCA collation tables"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 21:58:54 EDT