Re: Computing default UCA collation tables

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue May 20 2003 - 18:50:39 EDT

  • Next message: Rick Cameron: "Persian or Farsi? (was RE: Decimal separator with more than one c haracter?)"

    From: "Kenneth Whistler" <kenw@sybase.com>
    > Philippe Verdy asked:
    > <quote>
    > 0433;CYRILLIC SMALL LETTER GHE;Ll;;;;0413;;0413
    > 0413;CYRILLIC CAPITAL LETTER GHE;Lu;;;;;0433;
    >
    > # Add a user-defined diacritic, to force treatment as a variant of ghe
    > 0491;CYRILLIC SMALL LETTER GHE WITH UPTURN;Ll;<sort> 0433 F8F1;;;0490;;0490
    > 0490;CYRILLIC CAPITAL LETTER GHE WITH UPTURN;Lu;<sort> 0413 F8F1;;;;0491;
    > </quote>
    >
    > The "<sort>" tag for these decompositions is an ad hoc addition,
    > understood only by the sifter program, to accomplish the weighting
    > as desired. The use of a user-defined character, U+F8F1, to
    > represent the "phantom" diacritic, is also an internal
    > convention for the sifter.

    This confirms what I was trying to do myself, by "reverse engineering" the DUCET table sothat it would be represented in a much more compact way using an extended table working like the NFKD decomposition table (which is also extending the NFD table),so that the only thing that remains to encode is the single set of weight associated to each decomposed collation element.

    Then I can deduce the level (primary, secondary, ...) of a decomposed collation element directly from the allocated weight which belongs to non overlapping ranges that can be chosen so that simple bit operations can be used to tailor more easily the UCA table (for example when using transliterators, or when customizing collation rules for particular non-UCA usages, such as giving to letter case a primary level so that uppercase letters are always sorted after lowercase letters).

    Such approach may also be beneficial for complex scripts as well, but I think it has other interests, notably for searches or localized regular expressions, when the NFKD decomposition and casefolding is not enough to satistfy user expectation (this is were collation makes sense, independantly of its secondary role for ordering).

    Your "sifter" program, but more importantly its "ad hoc" decompositions file are much interesting as it is a more synthetic representation of how UCA could be ideally implemented.

    Thanks a lot.



    This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 19:40:19 EDT