Re: Computing default UCA collation tables

From: Mark Davis (
Date: Mon May 19 2003 - 20:14:15 EDT

  • Next message: Jim Allan: "Re: Decimal separator with more than one character?"

    This is a very long document; I only have a few brief comments.

    1. Much of the UCA is derivable from a simpler set of orderings and
    rules. The format of the table, however, is intended to make it usable
    without having a complex set of rules for derivation.

    2. However, you or anyone else could make a modified version which was
    simplified in one way or another. For example, ICU preprocesses the
    table to reduce the size of sort keys (see the ICU design docs if you
    are curious: There are other ways that
    someone could preprocess the table. For example, you could also drop
    all those characters whose weights are computable from their NFKD
    form, for example, and then compute them at runtime.

    3. Scattered in and among your analysis are points where you believe
    there is an error. I'd like to emphasis again that the UTC does not
    consider arbitrary email on the mailing lists on its agenda. If there
    are items that you would like to see considered, you can extract them
    (and their justification) from this document, and use the feedback
    mechanism on the Unicode site to submit them.

    Märk Davis
    IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
    (408) 256-3148
    fax: (408) 256-0799

    ----- Original Message -----
    From: "Philippe Verdy" <>
    To: <>
    Sent: Thursday, May 15, 2003 02:36
    Subject: Computing default UCA collation tables

    > After studying how the default UCA collation keys are assigned in
    the existing (3.1.1) allkeys.txt file, I can figure out that this huge
    table (used as a base for all further tailorings) can really be
    simplified a lot (note that the 4th additional key contains the code
    for the base character and can always be dropped and generated
    > Note above, my comments or suggestions for improvements are shown
    within *** pairs.
    > 1) First there's a set of fully ignorable characters, with no
    specific order (the 4th key is always 0): this designates all
    characters that can be safely deleted from the input string for
    comparison purpose. They could be represented by just listing the
    characters or some general character classes like Cc for control
    characters in ISO636 (0-1F, 7F) or ISO6429 (80-9F), and format
    characters like ZeroWidthSpace and Bidi Controls like
    LeftToRightOverride (200B-200F, FEFF), Shaping controls (206A-206F),
    Interlinear Annotations (FFF9-FFFC), LanguageTags (E0001,
    E0020-E007F), and very few script-specific formatters in Syriac and
    Mongolian (70F, 180B-180E). These can be handled at run-time in UCA
    applications by exceptions to general table lookups for each script
    (using the small list of Unicode allocation blocks to determine the
    > All of them use [.], and could be represented simply by a
    single [.00] byte in collation keys for strings (other bytes implied).
    > 2) Then there are a set of optionally ignorable characters (which
    are ignored at level 3 or below), for scripts in this order: Cyrillic,
    Hebrew, Arabic, Thai, Tibetan, Musical combining symbols. Note that
    this order corresponds also to the default order, and correspond to
    extended versions of these scripts, with characters that are not
    needed (most often) to recognize words (such as Hebrew points).
    > All of them use key [.0.0.0.UUUUU] where UUUUU is the original
    character. Here also the key is implicit, and can be represented by a
    single [.00] byte in collation keys for strings (other bytes implied).
    > Note that on all the other collation keys discussed after, the L4
    key is always the codepoint for the original (not canonically
    decomposed) codepoint, even if it's a compatibility character with a
    singleton canonical decomposition. So we do not need to store the L4
    key which can be computed algorithmically by a simple identity
    > 3) Then there are all combining characters. They are ordered by
    script (but I think ***they should be first ordered by combining class
    to also match directly the NFD order or grouped matching when sorting
    NFC or NFKC strings***). All their L1 key are 0, and these characters
    arrange in each category mainly on L2 key, where L3 key is almost
    always 02, and L4 is infered implicitly as indicated above.
    > 3.1) general combining diacritics for Latin/Cyrillic/Greek in block
    U+03xx. Here also, note that a few are canonically decomposable or
    equivalents, and in this case, the NFD equivalent indicates the real
    characters for which the collation keys are infered. These are marked
    with a QQC or QQK field in the comment line, and could be removed from
    the table as they are computable. Matching exactly the existing UCA
    order requires writing a few tailoring rules for the begining of this
    set, for example the Combining LowLine, CommaAbove, ReversedCommaAbove
    are given lowest L2 priority, despite they have a higher combining
    class than the following diacritics (probably because their language
    usage carries very little importance, or they are considered pedantic
    and their use is almost optional in modern written language and do not
    even change the intonation, or are defined for completeness but
    actually not used in existing standardized languages). Another
    inversion is the solidus overla!
    > y (I don't know why), the Cedilla and Ogonek which come before the
    Macron, and a mix of diacritics starting at CombiningXAbove
    > 3.2) Then there are "ligature halves" for diacritics that should be
    displayed on the grapheme cluster of the last two characters
    > 3.3) Then there are script specific diacritics, ordered by their
    leading compatibility decomposition: Cyrillic, Hebrew points, Arabic
    (many compatibility equivalent deduced from NFKD, are inserted for
    initial, medial, final, or ligated forms, and when this occurs the L3
    is modified from the default 02 value to the compatibility value
    described in the UCA technical report), Syriac, Brahmic diacritics
    (Devanagari to Lao: nukta, candrabindu, anusvara, visarga, anusvara),
    Tibetan, Myanmar (similar to other Indic scripts), Khmer,
    Ideographic/Hangul/Kana tone marks, The Unicode codepoint of these
    characters determine their order in the L1 key. No tailoring needed,
    L3 is always 02,
    > 3.4) Then there are the combining technical characters (in U+20D0 to
    U+20E3). The L3 key is implied (02).
    > Then comes non ignorale characters. the NFKD decomposition drives
    the ordering of collation keys, which are assigned with a non zero L1
    key starting at [.0201.], and grouped in distinct ranges. If strings
    are already decomposed to NFKD, then combining diacritics are
    separated and handled by the previous rules, and there remains only
    base characters, which are distinct by their L1 key, L2 is always
    [..20.] and L3 is always [..02.] unless this is a compatibility
    character which is assigned a L3 key higher than [..02.] according to
    the UCA technical report for each class (this means that only the L1
    key needs to be stored for canonical characters without compatibility
    decomposition for the UCA table, as other collation values are
    > 4) The first range covers all standard Unicode 3 scripts with the
    exception of Han.
    > 4.1) First the standard spacing controls (e.g. TAB or LF), ordered
    by codepoint.
    > 4.2) Then starts the subsection for variable collation keys. This
    section is quite complex andshould be given by tailoring them in a
    lookup table:
    > 4.2.1) First are space characters (and their compatibility
    equivalent), all collated to the same L1 key, and then non-blank
    characters that play the role of a spacing mark (Ogham and Arabic).
    > 4.2.2) Then there are the spacing version of diacritics, simply
    ordered by codepoint (***I think that it's a shame that their order do
    not match the order of non-spacing combining diacritics, and this
    creates confusion***)
    > 4.2.3) Then comes the soft hyphens or syllable boundary "markers"
    (***it's a shame that some "soft" hyphens are placed after the real
    hyphen-minus which always has a visible glyph in any context***), and
    then other hypens or dashes, and KanaMiddleDots (all these are placed
    in the codepoint order, with a single tailoring for the generic
    Soft-Hyphen placed at the beginning of this set)
    > 4.2.4) Then comes generic in-sentence punctuation (and their
    compatibility equivalents): comma, semicolon, colon, followed by very
    script specific equivalent punctuations for groups in the same
    sentence (ordered by script, using the standard script order, then
    ordered by codepoint)
    > 4.2.5) Then are sentence termination punctuation: exclamation,
    question, full stop, leaders/ellipsis. I think that the small fullstop
    is not at the right place and should come before leaders just after
    fullwidth full stop. I also think that leaders should all have the
    same L3 key in the decomposition (not true for the ellipsis which is
    three leaders, but is decomposed with a distinct L3 key for the third
    one). Other script-specific full stop follow, ordered by script.
    > 4.2.6. Then comes subsets of apostrophe, quotation marks. Some
    tailoring is present here with some inversions to make similar signs
    closer (angle quotation marks are placed after strait and comma-shaped
    quotation marks).
    > 4.2.7. Then comes parentheses (note that all compatibility
    decomposable parenthesized characters are ordered at the opening
    parenthesis, according to their decomposition, so all these could be
    dropped from the UCA table and computed), and their compatibility
    equivalent; same thing for other pairs: square bracket, curly bracket,
    tibetan and ogham similar marks, square bracket with quill...
    > 4.2.8. Then comes general technical/typographical symbols: section
    marks, commercial (copyright, registered, at), mathematical (asterisk,
    solidus, ampersand, number sign, percent, permille, per ten thousand),
    daggers and bullets, and prime (***I think that prime characters
    should be placed after apostrophes in 4.2.6***), ditto, insertion
    marks (caret, reference mark, character tie...).
    > 4.2.9. Then comes script-specific punctuation, in the common script
    order then sorted by codepoint: Armenian, Hebrew, Syriac, Mongolian,
    Brahmic (Devanagari to Thai), Tibetan, Myanmar, Khmer. (*** I think
    that some of them, such as Armenian apostrophe, should be reordered
    after some general punctuation groups above in 4.2.4 to 4.2.8 ***)
    > 4.2.10) Then comes some problematic sections for modifier letters
    and some signs: this is where prime is defined as a modifier (*** but
    should'nt it be ordered with the technical primes in 4.2.8, which
    themseles should sort after apostrophes in 4.2.6?***), other modifier
    letters (***which are very near from spacing diacritics in 4.2.2 above
    and should better e placed there for collation purpose***), and the
    degree sign (***should better be sorted just the spacing ring in
    4.2.2***). Tailoring may correct this, but the UCA table should work
    in this.
    > 4.2.11) Then comes some script-specific technical signs or marks
    (not punctuation), simply ordered by codepoints, starting at the
    cyrillic thousands sign (most of them are Tibetan, but there are one
    for Arabic, Bengali, Oriya, Thai, and Canadian Syllabics).
    > 4.2.12) Then are all the arrow symbols (a few are canonically or
    compatibly decomposable and marked with a QQC and QQK field in the
    comment, canonical decompositions using a combining overlay solidus)
    > 4.2.13) Then are all the mathemetical symbols. (***Some of them
    should be reordered with Greek characters or with characters with non
    mathematical semantics but with identical script name, to match them
    with usage notaly for searches where many confusions are possible***).
    The current ordering is strange: look for example classical 4
    mathematical operators (plus sign, division sign,multiplication sign)
    not grouped with minus sign,division slash, divides. Some of them will
    definitely match poorly in searches such as tilde, angle, union, and
    the remaining ones should be grouped by function (identical to,
    equivalent,...). Some operators also exist with circled variants that
    should be grouped with their non circled variants, like for circled
    digits and letters. Specific tailoring is complex, and UCA should
    evolve to map them more logically. For now, this block just uses the
    codepoint ordering.
    > 4.2.14) Then comes technical symbols, ordered by codepoints. ***Some
    of them may be "unifiable" by tailoring with other generic characters
    for search purpose (for example the diameter symbol may be
    decomposable non compatibly as a circle and solidus overlay, and some
    APL functional symbols could be ordered with Greek letters or
    mathematical symbols, or decomposed non compatibly such as symbols
    with dieresis)***.
    > 4.2.15) Then comes symbols for control characters (there are not
    ignorable, like their control counterparts, so it's best to keep them
    ordered as they are by codepoint order.), OCR characters, box drawing
    characters, block and geometric characters, dingbats and wingdings
    (***the placement of asterisk-like and bullet-like symbols, or
    ornamented punctuations and arrows could be discussed***), Braille
    patterns (assuming that they cannot be confused with punctuation in
    any context, they howerver represent characters according to
    locale-specific transliteration rules, and they would never be
    searched as is, so sorting them does not seem relevant, and it's best
    to keep them sorted as is by their pattern codepoint), byzantine
    musical symbols, and venitian/western musical symbols (***some
    dingbats above could be rearranged to this subrange***).
    > 4.2.16) Then comes Asian symbols: ideographic description characters
    couldbe used to create decomposed ideographs by radical, within a Han
    rendering engine to limit the size of required fonts. It is followed
    by some CJK symbols (***the ordering of the ideographic variation
    indicator is questionable***)
    > 4.2.17) Here are the character and object replacement characters
    (***I don't think this is the right place to put them, as they
    probably should be at end of the "variable" collation elements before
    modifier letters***)
    > 4.2.18) The last part in variable collation elements is for
    international non-base10 numbers or digits (***shouldn't they sort
    after other base10 digits in the non variable section?***)
    > This terminates the variable block, which may still need some
    reworks, to have more "similar" characters tailored to match usage,
    notably for punctuations. There's no gap here before the non variable
    > 4.3) Then comes modifier letters for length or repeat marks (***may
    be they should sort in the variable block like other spacing
    > 4.4) Then comes all currency signs.
    > 4.5) Then letter-like symbols (***shouldn't they sort with their
    corresponding letters?***)
    > 4.6) Then non-base10 Roman numeral (***why don't they sort like
    characters in section 4.2.18?***)
    > 4.7) Then music signs (*** why don't they sort with musical
    characters insection 4.2.15?***)
    > 4.8) Then base10 digits (composite compatibility characters are
    sorted with their decomposed digit value, but with a distinct L3
    value). They are not ordered by script, as they all use the same L1
    key, but they are decomposed as if there was a diacritic after the
    ASCII decimal digit, in a supplementary NFKD-like decomposition. This
    pseudo-diacritic is simply acting as a script modifier, one for each
    script, allocated in the standard script order after those declared in
    section 3 above with a zero L1 key, and a new L2 key value per script,
    and a default (implied) L3 key value 02. (This decomposition is quite
    similar to similar annotations used in some scripts like Greek that
    combine a letter and a numeric modifier diacritic to create
    traditional numbers, except that Unicode UCD does not define such
    pseudo-diacritic for script transliteration in scripts that have
    distinct characters for digits, and so the UCD cannot use it in NFD or
    NFKD decompositions). There's such pseudo-di!
    > acritic defined in UCA decompositions for Arabo-Indic, Extended
    Arabo-Indic, Brahmi scripts (from Devanagari to Thai and Lao),
    Tibetan, Myanmar, Ethiopic, Khmer, Mongolian, Hangzhou (traditional
    Chinese written dialect).
    > In this section, ideographic symbols for months, days, and hours are
    decomposed as per their decomposed NFKD value (which includes one or 2
    digits and a ideographic character), and sorted as per their first
    digit, only changing the L3 key of digits to 04, and appending a HAN
    collator for the ideographic character for month/day/hour (splitted in
    two parts, one collating in [L1=FB40+high bits, L2=default 20, L3=max
    1F], the second one collating in [8000+low bits, L2=0, L3=0]); other
    digit variants are also added using the L3 key described in the tale
    of the UCA reference, as well as fractions decomposed the standard way
    like other NFKD decompositions. So this large set of UCA entries can
    be reduced to just 10, for the base digits, with help of NFKD and
    script-specific pseudo-decomposition.
    > 4.9) Then comes all Latin letters (accented versions are also
    decomposed with their NFD, then NFKD only when creating UCA collation
    keys), from A to Z. All font variants (such as mathematical) collate
    to the same distinct L3 value, but share the same L1/L2 keys than the
    base letters. Capital versions are just changing the L2 key from the
    base small letter.
    > Some letters are inserted in the list, to collate specially at some
    place (for example the ligated "ae" letter is placed between "a" and
    "b", "dz" is placed between "d" and "e", "ij" is placed between i and
    j, german sharp s is treated like a pair of s, as if all those were
    ligatures, Eth is somewhere after "dz" but before "e"), but ***there
    are some places to tailor these extended Latin letters (for example
    the Latin alpha, or turned A or turned alpha)***.
    > Some square symbols used in ideographic scripts to represent
    international units are decomposed to NFKD before generating UCA
    collation keys, as well as symbols like "c/o, and roman numerals
    independantly of their numeric value (but why only roman numerals 1 to
    9, and not higher ones?
    > ***In my opinion, this order should also interleave transliterated
    Greek and Cyrillic characters, notably those that are highly similar
    like A and Alpha, look for example such usage in Serbian that can be
    written either with Latin or Cyrillic scripts. The UCA would mostly
    match the Latin sort order, matching the traditional Cyrillic or Greek
    sort order is a question of language, so of application level
    tailoring that should be done too in sync for Latin***
    > 4.10) Then comes Greek letters, in traditional Greek sort order as
    per the basic Greek codepoints (diacritics decomposed, and case marked
    in L2 key), plus Coptic letters ***See my comment on this in section
    4.9 (for example look how Latin alpha is sorted: why shouldn't it sort
    like Greek alpha, and not only as a Latin A ?). APL functional symbols
    should also be sorted there (see my comment in 4.2.14)***. Here also
    their mathematical font variants are sorted along the corresponding
    base Greek letter from the NFKD decomposition.
    > 4.11) Then comes Cyrillic letters, in traditional Cyrillic sort
    order as per th basic Cyrillic codepoints (diacritics decomposed, and
    case marked in L2 key). ***See my comment on this in section 4.9 (for
    example look how Serbian sorts in both Latin or Cyrillic scripts ?).
    Some extended letters at end of the basic cyrillic alphabet don't have
    a clear sort order (so tailoring may be needed in applications).
    > 4.12) Then comes Georgian and then Armenian (but case difference is
    only a L3 variant, as if it was a diacritic).
    > 4.13) Then comes Hebrew (no special problem here, as hebrew points
    are decomposed canonically, ***however some lines are commented out,
    disabling the canonical decomposition of dagesh diacritics on
    compatibility final forms***), and then Arabic (and its many
    compatibility forms or ligatures). Here also, most lines could be
    dropped from the table using simply the NFD and NFKD decompositions.
    > 4.14) Then comes Syriac (note that some letters are decomposed with
    the Garshuni diacritic, which is still not defined in Unicode as a
    separate diacritic with a codepoint, so the decomposition appears only
    in UCA), Thaana,
    > 4.15) Ethiopic syllables (could use a pseudo-decomposition as
    pseudo-consonnant + pseudo-vowel sign, similar to Brahmic scripts with
    an implied A)
    > 4.16) Brahmic scripts are sorted individually, ***despite they have
    a common structure, and could be combined (using official
    transliteration rules), at least Devanagari, Bengali, Gurmukhi,
    Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam***
    > 4.17) Then comes the more complex Sinhala alphabet. ***However some
    lines in the UCA table are decomposing sequences that represent the
    same non-decomposed letter with identical collation; some are
    commented out, some aren't I just wonder if it's an error or temporary
    edit in the file that was forgotten***
    > 4.18) Then comes Thai and Lao characters ***(with ambiguities coming
    from multiple encodings of the same letter, either decomposed or
    precomposed, such as SARA AM, and this may be an error in the
    canonical decomposition table, and could cause problems in
    applications that expect a unique canonical encoding for the same
    > 4.19) The Tibetan script is more simple (subjoined letters are
    however sorted after the corresponding normal letter and not in their
    codepoint value). ***However it also uses multiple source encodings
    for the same vowel signs, and this may be ommissions in the NFD
    decomposition tables, notably for long vowels ***
    > 4.20) The Myanmar script is sorted like Thai and Lao ***(with the
    same problem on some long vowels)***
    > 4.21) Khmer comes after using simple codepoint ordering.
    > 4.22) Mongolian use simple codepoint ordering on L1 ***(except that
    TODO, SIBE, MANCHU, and ALI GALI versions of some vowels or
    consonnants are tailored after the standard corresponding letter,
    using L1 keys instead of being considered as L2 or L3 variants).***
    > 4.23) Cherokee syllables are then sorted simply by codepoints (that
    logically use a consistant ordering of their final vowel, and grouping
    characters by similar soft and hard similar leading consonnants with
    same final vowel)
    > 4.24) Canadian Syllabics is quite similar to Cherokee with an
    already consistant ordering of codepoints ***In my opinion the R-Cree,
    West-Cree, Y-Cree, N-Cree, Naskapi, Athapascan, Carrier or Sayisi or
    medial variants that sort just after the corresponding standard
    syllable should use a L2 distinction, similar to case distinction, and
    not a L1 distinction. Some other codepoints, separated in the
    allocation space should also be grouped such as RE or THE syllables
    > 4.25) Ogham letters are simply ordered by codepoint.
    > 4.26) Runic uses precomputed tailoiring and does not follow strict
    codepoint ordering
    > 4.27) Hangul is computable algorithmically for syllables by
    decomposition, and compatibility letters are sorted from their NFKD
    equivalent using L3 differences.
    > 4.28) Hiragana and Katakana are mixed in the collation table by L2
    equivalence (using L3 differences). Simple derivation from NFKD
    removes halfwidth or circled variants by creating L3 differences or by
    creating equivalent sequences for squared words. However small
    variants sorting before the standard form with L3 difference does not
    require tailoring, unlike grouping Katakana with Hiragana with only a
    L3 difference. Apart from these L3 differences, there's no L2
    difference. Softened/Voiced consonnants are sorted by appending the
    collation element for the voice mark diacritic to the "similar" hard
    consonnant (Hira/Kata differences are more influent)
    > 4.29) Bopomofo is mostly sorted with L1 differences from codepoints
    order (only final versions use L3 difference and don't follow the
    codepoint order) with some tailored exceptions like LETTER NGG.
    > 4.30) New standardized scripts for Old Italic and Gothic are sorted
    by codepoints
    > 4.31) Deseret is just rearranged to sort capital letters with their
    base small letter with L3 difference, and fixed L2 value.
    > This terminates the first contiguous range of non null L1 keys. Then
    > 5) CJK ideographs, CJK compatibility ideographs and CJK radicals are
    allocated with implicit weights, using pairs of collation keys, the
    first one allocated at [L1=FB40+high bits, with implicit L2=20, L3=2],
    and the second at [L1=8000+low bits, L2=L3=0]. Compatibility
    characters are simply remapped to their canonical equivalent before
    collating. All elements in this area can be simply computed though
    NFD/NFKD decomposition with implicit weights (some of these include
    codepoints allocated in the extension ideographs, but are decomposable
    to basic CJK ideographs). ***So all this large table can be removed.
    So codepoints do not seem to have any tailored sorting and use a
    default weight from codepoint values.***
    > 6) A second range for extension CJK ideographs is similar except
    that it uses L1=FB80+high bits for the first collation key in the
    generated pair. ***Here also the table could be avoided as only
    canonically decomposable characters are present.***
    > ====
    > What I want to demonstrate here is that the large default UCA table
    contains a lot of dedundance and does not help creating a good
    implementation. A simplified version should be designed and documented
    in the UCA algorithm reference, so that the existing table could be
    automatically derived.
    > As it is, there are only three cases:
    > a) fully ignorable, and optionally fully ignorale codepoints that
    could be deduced from character properties L1=L2=L3=0, found in UCD.
    > b) combining characters, and pseudo-diacritics for script overrides:
    L1=0, L2 in [00..1F], L3=02 (other values of L3 can be generated
    automatically from compatibility equivalences found in UCD)
    > c) all other scripts, ordered individually with L1>=0201, L2=20
    (other values can be generated from case properties), L3=02 (other L3
    values generated automatically from compatibility euivalences found in
    > I do think that a new much shorter table format should be designed
    which lists only ordering of non decomposable characters per script,
    and removing all characters which have same case folding as their base
    character. The interest of this approach would be less works by
    Unicode to update this table, and easier generation of binary weights
    in applications (notably if there are user or language preferences
    when rearranging the relative scripts, instead of using the default
    UCA ordering of scripts, which roughly imitates the ordering of
    Unicode blocks).
    > The implementation of UCA collation should reuse the work already
    performed to implement NFD and NFKD decomposition, (possibly by adding
    a few pseudo-decompositions to the standard NFKD decompositions to
    handle the case UCA decompositions of digits), and a few character
    properties such as existing and standardized case folding tables.
    > So the standard UCA table should not override the above standardized
    tables, but should only list some "tailored" L1 weights in each script
    (needed only because codepoints of base characters do not match the
    common UCA ordering). I can see that the allkeys.txt was manually
    edited, and sometimes includes probable errors. The only interesting
    part of the default UCA collation table (DUCET) is the primary (L1)
    ordering of base characters within each script.
    > The current complication of the existing "allkeys.txt" table may
    explain why the updated UCA table could not be prepared when Unicode 4
    was released, and it probably demonstrates that the UCA collation
    algorithm is still not complete, and it should not be considered as a
    full standard as it is.
    > -- Philippe.

    This archive was generated by hypermail 2.1.5 : Wed May 21 2003 - 21:13:40 EDT