Re: Computing default UCA collation tables

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 20 2003 - 16:11:04 EDT

  • Next message: Philippe Verdy: "Fw: Computing default UCA collation tables"

    Following up on Philippe Verdy's responses to Mark Davis:

    > That's why I wondered how the "allkeys.txt" was really produced,

    As Mark Davis indicated, it is generated by a program (which I
    maintain) called "sifter". It *is* automatically generated
    and not manually edited.
     
    > because it uses some weights ordering that is not documented in
    > the UCA collation rules specification

    *None* of the weights in the allkeys.txt table are documented
    in the UCA collation rules specification per se. All of the
    weights for the Default Unicode Collation Element Table are
    defined in allkeys.txt, which the UCA specification refers to
    by reference. Specific *examples* cited in the text of UCA
    to illustrate the use of the weights are exemplary only -- not
    normative in value -- and the authors of UTS #10, UCA (that's
    Mark Davis and myself) deliberately don't try to update all
    those example numbers each time allkeys.txt is updated, since
    there is no point, and since such manual editing probably would
    induce errors and inconsistencies into the examples.

    > (the only thing that is
    > normative, the "allkeys.txt" being just informative and a
    > correct implementation of the specified rules).

    As Mark indicated, this statement is just flat wrong. A claim
    of conformance to UTS #10 includes a requirement to abide
    by conformance clause C4. That requires specifying a particular
    version of the UTS, and that, in turn, via Clause 3.2, points
    to a particular, associated version of allkeys.txt. And that
    table, in turn, is required to meet the requirements of
    conformance clause C1.

    So while the UCA allows any tailoring you desire, to meet particular
    language requirements, it is still quite clear that the data
    table itself is a normative part of the standard.

    Philippe also seems to have missed the fact that the allkeys.txt
    table is maintained in conjunction with and in synchrony with the
    Common Template Table of the ISO international string ordering
    standard, ISO/IEC 14651. That table is also generated by the
    "sifter" program, and the CTT is clearly labeled normative in
    ISO/IEC 14651.

    > Yes my message was long, but I wanted to show the many points
    > coming from the analysis of the "allkeys.txt" proposed as an
    > informative reference,

    It is not an "informative reference", but a normative part of
    the standard.

    > and wondered how to simply create a conforming collation,
    > without importing the full text file (which is not only very
    > large for an actual implementation,

    One can preprocess it, as Mark indicated. But one cannot ignore
    it and be compliant with the standard.

    > but also incomplete face to Unicode 4,

    This is known and being addressed for the next revision.

    > and implements some custom tailorings that are NOT described
    > in the UCA reference,

    This reflects a fundamental misunderstanding of the role of the
    allkeys.txt table, and is also simply wrong.

    > still incomplete and probably contains a few incoherencies,

    Incomplete, yes. But as Mark indicated, the file is regularly
    tested for a large number of consistency issues, each time it
    is updated.

    > proving the fact that this file was edited manually,

    It was not. The "proof" is fallacious.

    What Philippe has demonstrated is a fact well known to those
    who develop, maintain, or review the UCA standard and allkeys.txt:
    the primary order definition and a number of other required
    quirks in ordering depend on a specific input data file, and
    cannot be derived automatically from UnicodeData.txt or any
    other of the UCD data files.

    If Philippe had inquired about its derivation, instead of
    trumpeting his discoveries from reverse engineering, it
    might have been possible to short-circuit a lot of the
    FUD involved in the questions he has raised.

    > and may contain errors or other omissions).

    This is certainly possible.

    >
    > However, analyzing how the table was produced allows to
    > create a simpler "meta"-description of its content, where this
    > file could be generated from a much simpler file (or set of files),

    Ta da! It *is* generated from a simpler set of files.

    > so that such large table could be more easily maintained
    > (even if there are some manual tailoring for specific scripts,

    Tailoring is a process of changing *from* the default table.
    It does not describe the definition of the primary (and other
    particular orders) that go into the generation of the default
    table itself.

    > So despite I think that this table MAY be useful for some
    > applications, I still think that it is not usable in the
    > way it is presented.

    Demonstrably false, since it *is* used as presented, by ICU
    and by other implementers of UCA.

    >
    > Also my preious message clearly demonstrated that this
    > collation table uses some sort of "collation decomposition"
    > which includes some collation elements that can be thought
    > as "variants" or "letter modifiers" for which there is no
    > corresponding encoding in the normative UCD with an
    > associated normative NFD or NFKD decomposition.

    Again, this demonstration was a "discovery" of things that
    are well known about the input file used for generating
    allkeys.txt. Here's an example piece of the input data:

    <quote>

    # To make the spacing ypogegrammeni work best, it should be
    # equated to the regular iota, rather than to the combining
    # mark.

    037A;GREEK YPOGEGRAMMENI;Lm;<sort> 03B9;;;;;
    # 037A;GREEK YPOGEGRAMMENI;Lm;<compat> 0020 0345;;;;;

    </quote>

    Such modifications of compatibility decompositions (or the
    addition of decompositions for which none exist in
    UnicodeData.txt) are a required and reviewed part of creating
    the input which the sifter then manipulates to generate
    allkeys.txt (and the CTT table for ISO/IEC 14651).

    >
    > The current presentation of this table (with 4 collation
    > weights per collation element) does not ease its implementation,
    > and a simpler presentation with a unique weight (selected in a
    > range that clearly indicates to which collation level it
    > belongs to) would have been much more useful and much simpler
    > to implement as well.

    As Mark and I have both stated, anyone is free to preprocess
    the allkeys.txt table into whatever form they choose for
    their implementation. However, the current format of the table
    is the result of consensus decision by the Unicode Technical
    Committee, and is unlikely to be changed, since that would
    destabilize it for implementers -- including those who have
    tools to preprocess the current format into whatever format
    they prefer to use.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 17:28:51 EDT