Re: Computing default UCA collation tables

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue May 20 2003 - 10:55:23 EDT

  • Next message: Mark Davis: "Re: Computing default UCA collation tables"

    From: "Mark Davis" <mark.davis@jtcsv.com>
    To: "Philippe Verdy" <verdy_p@wanadoo.fr>; <unicode@unicode.org>
    Sent: Tuesday, May 20, 2003 2:14 AM
    Subject: Re: Computing default UCA collation tables

    > This is a very long document; I only have a few brief comments.
    >
    > 1. Much of the UCA is derivable from a simpler set of orderings and
    > rules. The format of the table, however, is intended to make it usable
    > without having a complex set of rules for derivation.
    >
    > 2. However, you or anyone else could make a modified version which was
    > simplified in one way or another. For example, ICU preprocesses the
    > table to reduce the size of sort keys (see the ICU design docs if you
    > are curious: oss.software.ibm.com/icu/). There are other ways that
    > someone could preprocess the table. For example, you could also drop
    > all those characters whose weights are computable from their NFKD
    > form, for example, and then compute them at runtime.

    As far as I know, ICU does not preprocess the "allkeys.txt", but uses its own "FractionalUCA.txt" file, probably manually edited after some intermediate parsing (also it is currently derived from Unicode 3.2, because "allkeys.txt" was still not created for Unicode, probably because such processing is difficult or impossible to produce automatically).

    That's why I wondered how the "allkeys.txt" was really produced, because it uses some weights ordering that is not documented in the UCA collation rules specification (the only thing that is normative, the "allkeys.txt" being just informative and a correct implementation of the specified rules).

    > 3. Scattered in and among your analysis are points where you believe
    > there is an error. I'd like to emphasis again that the UTC does not
    > consider arbitrary email on the mailing lists on its agenda. If there
    > are items that you would like to see considered, you can extract them
    > (and their justification) from this document, and use the feedback
    > mechanism on the Unicode site to submit them.

    Yes my message was long, but I wanted to show the many points coming from the analysis of the "allkeys.txt" proposed as an informative reference, and wondered how to simply create a conforming collation, without importing the full text file (which is not only very large for an actual implementation, but also incomplete face to Unicode 4, and implements some custom tailorings that are NOT described in the UCA reference, still incomplete and probably contains a few incoherencies, proving the fact that this file was edited manually, and may contain errors or other omissions).

    However, analyzing how the table was produced allows to create a simpler "meta"-description of its content, where this file could be generated from a much simpler file (or set of files), so that such large table could be more easily maintained (even if there are some manual tailoring for specific scripts, or for scripts that still don't have any coherent collation order, such as Han).

    So despite I think that this table MAY be useful for some applications, I still think that it is not usable in the way it is presented.

    Also my preious message clearly demonstrated that this collation table uses some sort of "collation decomposition" which includes some collation elements that can be thought as "variants" or "letter modifiers" for which there is no corresponding encoding in the normative UCD with an associated normative NFD or NFKD decomposition.

    The current presentation of this table (with 4 collation weights per collation element) does not ease its implementation, and a simpler presentation with a unique weight (selected in a range that clearly indicates to which collation level it belongs to) would have been much more useful and much simpler to implement as well.



    This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 11:45:58 EDT