Proposing NCC and NCD (Normalized Collation (De)Composition) forms aligned with UCA

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 15 2003 - 11:52:55 EDT

  • Next message: Otto Stolz: "Re: 'code unit' and 'code point' meaning check"

    Another note, I looked further in the UCA table, and the way it is computed seems to demonstrate that it really uses a decomposition that has no other equivalent with NFD or even NFKD decomposition, as if there were variant selectors suffixed after some characters.
    These assumed variant selectors come in three subsets, whose usage vary per script type:

    For Latin:
    - [0.153.04] is used to append a variant selector for Latin or Runic letters. For Latin, it creates a ligating variant for O in the French ligature oe or OE, or for the German ligature of long s with final s. Note that the isolated long s uses [0.154.04.]
    - No such ligating variant selector is inserted for the Latin ae or AE ligatures, despite their collating order in UCA which sorts them just after a or A with a primary difference, but not really as a ligated form with secondary difference (and despite few languages consider it as a separate letter with primary difference, most of them handling it with a tertiary difference such as in French where it sorts like accents, ignored when considering only the secondary case difference).
    - [0.154.04] is used for the isolated long s variant (despite in traditional Latin script, the long s form is used for initial or medial placement, but the standard s is used for isolated and final placement). it is used also for long s with dot above, and for the long s+t ligature (which HAS a NFKD decomposition, unlike the German sharp s despite its semantics in German).

    For Runic:
    - [0.153.04] to [0.158.04] is used for alternate letter forms (which have distinct names, but collate together with no primary difference, and despite they don't play the role of secondary difference for letter case, but act like accents for tertiary differences.

    For Cyrillic, Arabic, Syriac, Tibetan:
    - [0.153.04] is mostly used like in Runic to create form variants: upturn forms (Cyrillic), Sindhi forms (Arabic), Garshuni and reversed forms (Syriac), fixed-forms (Tibetan).

    For Bopomofo:
    - [0.153.04] plays a role similar to the Hiragana/Katakana voice marks to create "soft consonnant" variants, and it is clearly similar to an accent (tertiary) difference.

    For CJK:
    - [0.154.1F] is used after the implied collation elements, mostly for CJK radicals and some compatibility CJK radicals. It seems that this clearly plays the role recently given to variant selectors, and Unicode has already started publishing NFKD decompositions for them...

    Then another set of variants is used for decimal digits, to sort them as western (arabic) digits used in Latin/Cyrillic and in modern Greek scripts, and just allow changing the font style with Dingbats (but shamely not for mathematical font forms in the newly assigned mathematical font variant blocks), or to select a script specific variant, in a way similar to the "prime" letter modifier in traditional Greek:

    - [0.159.06.] is a variant selector appended for Dingbat Negative Circled digits
    - [0.15A.06.] is a variant selector appended for Dingbat Circled Sans-Serif digits
    - [0.15B.06.] is a variant selector appended for Dingbat Negative Circled Sans-Serif digits
    - [0.15C.02.] is a variant selector appended for Arabic-Indic digits
    - [0.15D.02.] is a variant selector appended for Extended Arabic digits
    - [0.15E.02.] is a variant selector appended for Ethiopic digits
    - [0.15F.02.] is a variant selector appended for Devanagari digits
    - [0.160.02.] is a variant selector appended for Bengali digits
    - [0.161.02.] is a variant selector appended for Bengali currency numerator digits
    - [0.162.02.] is a variant selector appended for Gurmukhi digits
    - [0.163.02.] is a variant selector appended for Gujarati digits
    - [0.164.02.] is a variant selector appended for Oriya digits
    - [0.165.02.] is a variant selector appended for Tamil digits
    - [0.166.02.] is a variant selector appended for Telugu digits
    - [0.167.02.] is a variant selector appended for Kannada digits
    - [0.168.02.] is a variant selector appended for Malalayam digits
    - [0.169.02.] is a variant selector appended for Thai digits
    - [0.16A.02.] is a variant selector appended for Lao digits
    - [0.16B.02.] is a variant selector appended for Tibetan digits
    - [0.16C.02.] is a variant selector appended for Myanmar digits
    - [0.16D.02.] is a variant selector appended for Khmer digits
    - [0.16E.02.] is a variant selector appended for Mongolian digits
    - [0.16F.02.] is a variant selector appended for Ideographic/Hangzhou digits
    - [0.170.02.] is a variant selector appended for a few Old Italic digits

    Note that Roman digits are sorted as letters with a secondary difference, but it does not use the exponential additive decimal system but an additive/substractive system... And Tibetan "half" digits are sorted fully equally to non "half" digits despite they hae a lower value, and it should merit a supplementary variant decomposition for collation purpose, or to make the third weight to 03 instead of 02 in the variant selector appended for Tibetan standard digits, or or set the third weight to 01 for Tibetan half digits (but there's probably a problem in the NFKD decomposition of half digits)

    Why not then defining compatibility variant selectors in the UCD for those decompositions, so that sequences using them after a base character collate correctly, but they are recomposed when creating a normalized NF*C or NF*D string form?

    If such new variant selector characters cannot be defined now, it could be done by creating an extended version of the NFKD decomposition tables, using the standard VS1...VS256 variant selectors already defined in UCD.

    For example, the Normalized Collation Decomposition (NCD) of <LATIN CAPITAL LETTER OE> would be <LATIN CAPITAL O, VS1, LATIN CAPITAL E>.

    This "NCD" form would not (unlike NF* forms which have been freezed, despite there are some incoherences that still allow multiple encodings for the same character with exactly the same semantic) be fixed but would evolve with Unicode UCA versions. It would really simplify the implementation of UCA collation, because such decompositions allow assigning a ***single*** weight (instead of 3 or 4) in nonoverlapping ranges (so that their weight level can be implied) for all characters of strings in NCD form.

    With such definition of NCD, one can deduce a NCC (Normalized Collation Composition) form for recomposing strings: you will get a string which is also in NFC (and NFKC) forms, but not necessarily equal to the NFC (resp. NFKC) form of the original string (example: the THAI or LAO letters SARA AM).

    The idea behind it is a better collation of strings that are linguistically equivalent, and this would allow correcting post-composition NF* forms when they are freezed in a Unicode release. Some of those new "equivalences" could be independant of the language (when it just corrects an initial freezed error in the NF* canonical or compatibility decompositions from the released UCD), while still allowing language-specific tailoring for such decompositions.

    To make this possible, there could be new reserved characters assigned out of UCD in place U+Exxxx explicitly for special collation purpose. They would be non-characters (like surrogates), but the new file would have a format similar to UCD, with additional decompositions for already defined characters that otherwise are not decomposable with NFD or NFKD rules.

    The proposed decomposition would correct missing corrections that should have been present in UCD but has been freezed for NF* forms, plus decompositions of existing characters using collation non-characters defined in the same file.

    With this file, we could then give a single default UCA weight to all characters, in ranges specific to their collational order in UCA, but where the given value would not overlap. If needed, this single weight could be decomposed with a symbolic base weight (representing the script type in which weights are defined and the collating level) plus an offset. The list of symbolic base weight for fully decomposed collation elements could be defined in a separate file, for each range, providing several classes of collation, that whose base collation can be overriden by user preferences or by language-specific tailoring (where the whole ranges of characters for a script can be simply reordered)



    This archive was generated by hypermail 2.1.5 : Thu May 15 2003 - 13:03:09 EDT