Re: Computing default UCA collation tables

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 20 2003 - 20:58:38 EDT

  • Next message: Roozbeh Pournader: "Re: Persian or Farsi? (was RE: Decimal separator with more than one c haracter?)"

    Philippe Verdy asked:

    > > The QQKN entries are compatibility decompositions involving expansions.
    > > The QQKM entries are compatibility decompositions with combining
    > > sequences involved (just a few Arabic positional forms with harakat).
    >
    > Although comments are "informative", it would make sense to explain
    > and verify that they can be considered as such, and so we can trust
    > themthat they involve such expansions and combining sequences

    That is certainly feasible, although low priority. Another
    possibility is that if these comments are causing confusion, they
    can just be suppressed in the next version of allkeys.txt.
     
    > (however this may just be a temporary solution that may change
    > later without notice if other requirements are needed to satisfy
    > ISO14651 needs).

    It has nothing to do with ISO/IEC 14651 needs. No such comments
    are printed in the 14651 table; the CTT table has a completely
    separate set of comments, not on a per-character basis.

    > > In the input source file I have been citing.
    >
    > You just quote a few entries from your file. Do you mean I
    > still need to discover them from the careful analysis of
    > "allkeys.txt" ?
    > Or your input file is published somewhere (even though
    > it is not normative as it contains "private use" characters).

    In the past, versions have been made available to the encoding
    committees reviewing the final tables.

    John Cowan also inquired about this. I suppose if enough people
    feel it is useful, the input file could be published as part
    of the UCA (as a link from the text), with the appropriate
    caveats. There isn't much that people could do with it,
    however, except look at it to figure out how the normative
    table (allkeys.txt) got some of the weight values it did.

    > Your input file just appears to reuse the same format as UCD,

    Yes, the entries for the input file are extracted from the UCD,
    omitting irrelevant fields. They then need to be further
    processed, to establish the primary ordering and to deal
    with all the special cases.

    > So my question is: do your "ad hoc" decomposition use
    > **ALL** NFKD decompositions (in addition to "adhoc" collation
    > decompositions), or is there some possible exceptions
    > that need to be decomposed differently in your file?

    There are exceptions. You saw an example in the PROSGEGRAMMENI
    I cited earlier, but, for example, all the compatibility
    decompositions involving the equating of spacing diacritics
    to SPACE plus nonspacing marks are suppressed:

    #00B4;ACUTE ACCENT;Sk;<compat> 0020 0301;;;;;
    00B4;ACUTE ACCENT;Sk;;;;;;

    This is for consistency with the comparable ASCII characters
    (U+0060 GRAVE ACCENT, U+005E CIRCUMFLEX ACCENT, and U+005F LOW LINE)
    and to better match legacy collation treatment of these characters.

    Other characters, which *should* have compatibility decompositions
    but do not, have "<sort>" decompositions added in the input
    file:

    2E88;CJK RADICAL KNIFE ONE;So;<sort> 5200 F8F0;;;;;
    2E89;CJK RADICAL KNIFE TWO;So;<sort> 5202;;;;;

    > It's an important question because UCA currently only clearly
    > says that it only supports canonical (NFD) decompositions and
    > says nothing about compatibility decompositions (this is logical
    > for localizable collations, but probably not for DUCET).

    By the way, it is less confusing, in this context, to talk about
    canonical decompositions and compatibility decompositions,
    without introducing the terms "NFD" and "NFKD", which refer
    to normalization forms, instead.

    What the UCA states, and the DUCET reflects, is that collation
    weights for canonically equivalent sequences should also be
    equal.

    A *reasonable* approach for characters which have compatibility
    decompositions is to weight them by their compatibility
    decompositions, but this is not a *required* principle, as
    for the canonical decompositions. And not all of the
    compatibility decompositions make sense for collation. There is no
    particular reason why they should, since they were assigned
    in the first place according to different principles, and
    definitely not on the premise that using them would produce
    an optimal sorting order. Because of this, the definition of
    the default table for UCA (and for 14651) makes numerous
    adjustments** to compatibility decompositions (as shown in
    the examples above), to make them more useful for producing
    the desired results for collation weighting, which, after
    all, is the *point* of the table in the first place.

    --Ken

    ** Lest people start saying, "My gosh! I thought all
    decompositions in UnicodeData.txt were normative and could
    not be 'fixed'", let me clarify. The decomposition mapping
    field in UnicodeData.txt *is* normative (and immutable),
    and the UCA doesn't modify that. The UCA, unlike
    normalization, does not claim to be making direct use of
    the normative compatibility mappings, either. Instead,
    the UCA input file creates its own, ad hoc, equivalences,
    to solve its own problem: namely, collation weighting.
    Those equivalences, expressed as decompositions, are based
    *mostly* on compatibility decomposition mappings from
    UnicodeData.txt, but as I have illustrated, are somewhat
    arbitrarily extended, omitted, and/or tinkered with, to
    produce better collation results.

    People who worry about Unicode normalization forms not
    quite meeting their needs might want to consider the
    collation algorithm as a precedent. Nobody should take the
    compatibility decompositions at face value as meeting
    all equivalencing needs. Doing so is bound to produce one
    or another kind of unexpected result, depending on what
    you are doing with them. It is OKAY(tm) to create your
    own equivalences between Unicode characters to produce
    the desired results for particular processing. What is
    NOT OKAY(tm) is to claim that such custom equivalences
    are a replacement for the compatibility decomposition
    mappings in UnicodeData.txt for the purposes of formal
    Unicode normalization (by UAX #15).



    This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 21:58:54 EDT