Re: Computing default UCA collation tables

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue May 20 2003 - 19:27:24 EDT

  • Next message: Kenneth Whistler: "Re: Computing default UCA collation tables"

    From: "Kenneth Whistler" <kenw@sybase.com>
    > Philippe Verdy stated:
    > > After reading the proposed update (version 10) of TR10
    > > (which describes UCA algorithm), I note thatit still
    > > contains the same error in section 7.3 (Compatibility decomposition)
    > > when describing the way L3 weights are assigned to the decomposed sequence.
    >
    > Noted for correction.

    Thanks. At least it solves an ambiguity there (and probably the wording of section 7.3 could be more precise)...

    > > I also note that the allkeys.txt file contains a comment field
    > > starting with "QQ", whose signification or semantic is not
    > > completely clear. It appears when there are canonical or
    > > compatibility decompositions (QQC or QQK), but there are often
    > > a fourth character added there whose semantic is not clear (here QQKN).
    >
    > The QQKN entries are compatibility decompositions involving expansions.
    > The QQKM entries are compatibility decompositions with combining
    > sequences involved (just a few Arabic positional forms with harakat).

    Although comments are "informative", it would make sense to explain and verify that they can be considered as such, and so we can trust themthat they involve such expansions and combining sequences (however this may just be a temporary solution that may change later without notice if other requirements are needed to satisfy ISO14651 needs).

    > > Was it for debugging purpose, to see which type of "compatible"
    > > decomposition is performed,
    >
    > Yes, although it is for diagnostics, rather than debugging, per se.
    > The "QQ" is an otherwise unused letter sequence that makes it easy
    > to grep through allkeys.txt to collect together the weighted keys
    > of various types to look for consistency issues.
    >
    > Input source of relevance:
    > 017F;LATIN SMALL LETTER LONG S;Ll;<sort> 0073 F8F1;;;0053;;0053
    > (...)
    > 00DF;LATIN SMALL LETTER SHARP S;Ll;<sort> 0073 F8F0 0073;;German;;;
    >
    > > Where can we find those extra decompositions that are clearly
    > > used in UCA?
    >
    > In the input source file I have been citing.

    You just quote a few entries from your file. Do you mean I still need to discover them from the careful analysis of "allkeys.txt" ?
    Or your input file is published somewhere (even though it is not normative as it contains "private use" characters).

    I think Ican easily find those decompositions by looking for all secondary weights used in some collation elements that have a non null primary weight, but for which there's no corresponding combining character (with the same secondary weight but a null primary weight). For now, this includes all weights starting just after the last Unicode combining character in DUCET (between 0x153 and 0x200, though I think it currently stops at around 0x16A in DUCET version 3.1.1).

    This should generate approximately no more than about 200 additional decompositions I think (with DUCET 3.1.1), plus some still undocumented decompositions for scripts added in Unicode 3.2 and 4.0.

    Your input file just appears to reuse the same format as UCD, and in fact I wanted to write a complementary file that would be processed with the same code written to produce the NFD and NFKD tables. Except that I need to reverse engineer the table to produce a "template" file with unnamed characters. I started to write it using a custom tag <collate> with exactly the same function as in your file...

    So my question is: do your "ad hoc" decomposition use **ALL** NFKD decompositions (in addition to "adhoc" collation decompositions), or is there some possible exceptions that need to be decomposed differently in your file? It's an important question because UCA currently only clearly says that it only supports canonical (NFD) decompositions and says nothing about compatibility decompositions (this is logical for localizable collations, but probably not for DUCET).



    This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 20:09:53 EDT