Fw: Computing default UCA collation tables

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue May 20 2003 - 17:22:16 EDT

  • Next message: John Cowan: "Re: Computing default UCA collation tables"

    From: "Kenneth Whistler" <kenw@sybase.com>
    > Following up on Philippe Verdy's responses to Mark Davis:
    > > That's why I wondered how the "allkeys.txt" was really produced,
    > As Mark Davis indicated, it is generated by a program (which I
    > maintain) called "sifter". It *is* automatically generated
    > and not manually edited.

    Thanks for your information.

    I still wonder why some entries are commented out in the file and substituted by another definition on the next line (does "sifter" has known limitations that are manually edited after generation from the undescribed source file containing those simple hidden decompositions?).

    The quoted format you describe by a small fragment closely ressembles the one in the UCD. May be you don't want to disclose it because it uses some internal decomposition to private characters, but why then I did not find any reference to the other ISO/IEC 14651 standard you refer to in the description of DUCET ? May be Unicode does not define this default UCA collation order itself, and does not have the authorization from ISO/IEC to reproduce their ongoing work. This may explain why Unicode.org only publishes a derived file...

    Well I must admit that we can postprocess the "allkeys.txt", but this is hardly possible without "reverse engeneering" it, i.e. analyzing how it is structured. I hope that you are not saying that such "reverse engeneering" of the table is not legal (because of an "implicit" restriction by ISO/IEC 14651 which was not refered in the UCA document), because the UCA reference has many given hints to allow implementors to produce a compressed form of the table published by Unicode according to its royaltee-free usage terms. (May be I can find reference data from ISO/IEC 14651 somewhere else, I will search, but won't be able pay the hundreds of dollars needed to get a normative printed document from ISO...)

    My intent was not to formulate criticisms about the UCA algorithm itself, but about the way TR10 describes the DUCET table (possibly because its wording is ambiguous and does not seem to specify clearly that DUCET should be normative, given that the whole text of UCA clearly speaks about a more general algorithm, with variable weights that can be easily changed in many places, and provides a lot of tuning parameters for implementations as well as for language-specific tailoring, confirmed also in the fact that the text of TR10 does not match with the DUCET table content).

    So I really read the description of DUCET as ONE possible implementation of the UCA algorithm, and not as THE reference.

    Also I have read posts in this newsgroup about a candidate v10 for an updated version of the existing UCA TR10 v9 reference document. I thought that after posting this revision, you expected comments about it, and that's why I about it, i.e. the way I had read it.

    Sorry if all this seems quite "newbie" questions to you. May be there will be other questions like this in the future. I just hope that ICU is not the only way to go to implement Unicode, even if it's an open-source implementation (I am still looking for other good implementations more "modular" than the huge ICU library for more simple projects, without suffering limitations or bugs in standard libraries from OS vendors such as Microsoft, or from Sun in Java.)

    I am convinced that more simple implementations are possible that just fit the needs and expectations of users, but still adhere strictly to Unicode conformance rules without using outdated tables or algorithms.

    This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 18:14:34 EDT