Re: Computing default UCA collation tables

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 20 2003 - 18:01:17 EDT

  • Next message: Michael Everson: "RE: Decimal separator with more than one character?"

    Philippe Verdy asked:

    > I still wonder why some entries are commented out in the file
    > and substituted by another definition on the next line

    The reason is that the principle reviewers (in the UTC for
    the Unicode Collation Algorithm, and in SC22/WG20 for ISO/IEC 14651)
    require certain orderings in the default table for particular
    instances. The requirements for the final ordering drive back
    into requirements that certain characters be "marked up" in
    the input file, so that the automatic weight generation by
    sifter will produce the results expected. Let me give you
    another example:

    <quote>
    0433;CYRILLIC SMALL LETTER GHE;Ll;;;;0413;;0413
    0413;CYRILLIC CAPITAL LETTER GHE;Lu;;;;;0433;

    # Add a user-defined diacritic, to force treatment as a variant of ghe
    0491;CYRILLIC SMALL LETTER GHE WITH UPTURN;Ll;<sort> 0433 F8F1;;;0490;;0490
    0490;CYRILLIC CAPITAL LETTER GHE WITH UPTURN;Lu;<sort> 0413 F8F1;;;;0491;
    </quote>

    The requirement that U+0491 CYRILLIC SMALL LLETTER GHE WITH UPTURN
    sort as a secondary variant of U+0433 CYRILLIC SMALL LETTER GHE
    was established by Russian sorting conventions and then was formally
    conveyed in to SC22/WG20 as a requirement for the Common Template
    Table of ISO/IEC 14651. To accommodate that formal requirement,
    the input for the sifter has to be marked up with an ad hoc
    "decomposition" for U+0491, treating the "upturn" as if it
    were a diacritic, even though no such diacritic is encoded
    in the Unicode Standard as a separate combining mark, and even
    though U+0491 has no decomposition in UnicodeData.txt.

    The "<sort>" tag for these decompositions is an ad hoc addition,
    understood only by the sifter program, to accomplish the weighting
    as desired. The use of a user-defined character, U+F8F1, to
    represent the "phantom" diacritic, is also an internal
    convention for the sifter.

    > (does "sifter" has known limitations that are manually edited
    > after generation from the undescribed source file containing
    > those simple hidden decompositions?).

    No.

    >
    > The quoted format you describe by a small fragment closely
    > ressembles the one in the UCD. May be you don't want to disclose
    > it because it uses some internal decomposition to private characters,

    Correct, as shown above. It also has a number of other private
    conventions, such as markup to force weighting of certain
    combinations as contractions, and so on.

    > but why then I did not find any reference to the other
    > ISO/IEC 14651 standard you refer to in the description of DUCET ?
    > May be Unicode does not define this default UCA collation order
    > itself, and does not have the authorization from ISO/IEC to
    > reproduce their ongoing work. This may explain why Unicode.org
    > only publishes a derived file...

    Dream on. The table for ISO/IEC 14651 is produced by the *same*
    program, the sifter, and is provided by me directly *to* the editor of
    ISO/IEC 14651, for incorporation in that standard.

    Both the UTC and SC22/WG20 provide requirements for and feedback
    on the content of the default tables for the two, coordinated
    standards. Those two committees both have their say on details
    of the default ordering, and then, through the input file,
    its markup, and the sifter, I generate the two tables to their
    specification. That process, by the way, is acknowledged
    and agreed to by both committees,
    as a way to guarantee that the default weighting for both
    tables is synchronized, even though they use entirely different
    formats to express the weighting.

    >
    > Well I must admit that we can postprocess the "allkeys.txt",
    > but this is hardly possible without "reverse engeneering" it,
    > i.e. analyzing how it is structured. I hope that you are not
    > saying that such "reverse engeneering" of the table is not legal

    Nope. Feel free to reverse engineer away to your heart's content.

    Just don't expect to be congratulated for "discovering" things
    about the table that are well-known to the maintainers of the
    two standards and which are reflected in the input to the sifter
    and in the weighting algorithms used by the sifter itself.
     
    > (because of an "implicit" restriction by ISO/IEC 14651 which
    > was not refered in the UCA document), because the UCA reference
    > has many given hints to allow implementors to produce a
    > compressed form of the table published by Unicode according
    > to its royaltee-free usage terms.
    >
    > My intent was not to formulate criticisms about the UCA
    > algorithm itself, but about the way TR10 describes the DUCET
    > table (possibly because its wording is ambiguous and does not
    > seem to specify clearly that DUCET should be normative, given
    > that the whole text of UCA clearly speaks about a more general
    > algorithm, with variable weights that can be easily changed in
    > many places, and provides a lot of tuning parameters for
    > implementations as well as for language-specific tailoring,

    Correct. And the intent is that implementers are free to tailor
    the table to get the results they need for particular languages.
    And they can also implement the various shortcuts and tricks
    indicated, to keep the generated keys more compact, and so on.
     
    > confirmed also in the fact that the text of TR10 does not match
    > with the DUCET table content).

    Some of which doesn't matter at all. But I agree that you
    turned up a confusing mismatch in Section 7.3, where the collation
    element values should be updated. That should be corrected in
    the v10 Proposed Update to the UCA.

    >
    > So I really read the description of DUCET as ONE possible
    > implementation of the UCA algorithm, and not as THE reference.

    Perhaps the language of UCA needs to be updated as well, to make
    that clearer.

    > Also I have read posts in this newsgroup about a candidate
    > v10 for an updated version of the existing UCA TR10 v9
    > reference document. I thought that after posting this revision,
    > you expected comments about it,

    We do...

    > and that's why I about it, i.e.
    > the way I had read it.

    but instead of sending long, rambling analyses to the unicode
    list, including many mistaken assertions of fact about the
    standard, you can contact the authors of the document directly
    (our email addresses are in the header of the document) for
    clarifications of intent about the document, and then
    provide (succinct) feedback through the Unicode reporting form:

    http://www.unicode.org/reporting.html

    noting your feedback as a "Technical Report or Tech Note issue",
    so that the feedback can be properly archived, routed, and
    attended to by the UTC and the authors of the document.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 18:53:04 EDT