L2/00-141

Re: Proposed Changes to UTR#10: Unicode Collation Algorithm
Date: 2000.04.24
By: M. Davis, K. Whistler

The following are changes that should be made to the text of UTR#10.

3.2 Default Unicode Collation Element Table

1. Fixes for version numbering for consistency with all other "commands"

Old

New

<version> := <major>.<minor>.<variant> <eol>

@<version> := <major>.<minor>.<variant> <eol>

2. Fixes for contracting characters, add pointer to charts

Old

New

The Default Unicode Collation Element Table is available in two parts:

http://www.unicode.org/unicode/reports/tr10/basekeys.txt

http://www.unicode.org/unicode/reports/tr10/compkeys.txt

The file basekeys.txt lists all the Unicode characters that correspond to a single collation element. The file compkeys.txt lists all the Unicode characters that correspond to a sequence of collation elements.

The Default Unicode Collation Element Table is available in three parts:

http://www.unicode.org/unicode/reports/tr10/basekeys.txt

http://www.unicode.org/unicode/reports/tr10/compkeys.txt

http://www.unicode.org/unicode/reports/tr10/ctrckeys.txt

  • The file basekeys.txt lists all the Unicode characters that correspond to a single collation element.
  • The file compkeys.txt lists all the expanding Unicode characters, those that correspond to a sequence of collation elements.
  • The file ctrckeys.txt lists all of the weights for contractions, where a single collation element corresponds to a combining character sequence. These are provided for those instances where a precomposed character had to be given a distinct primary weight in the main weight table, which implied that the canonically equivalent combining character sequences should also be given the same weights. Entries in ctrckeys.txt currently deal with Indic two-part vowels and with Cyrillic accented characters, to match the expected collating behavior for those scripts.

A visual representation of the complete table can be found at http://www.unicode.org/unicode/reports/tr10/charts

3.2.2 Alternate Weighting

1. There was a missing case in the formulation of Shifted, necessary to process the completely ignorables correctly.

Old

New

Shifted: Variable collation elements are set to ignorable at levels one through three. In addition, a new final-level weight is appended, one that is equal to the old primary weight. All non-variable collation elements get a fourth-level weight equal to FFFF.
  • SPACE would have the value [.0000.0000.0000.0209]
  • Capital A is changed to the value [.06D9.0020.0008.FFFF]
Shifted: Variable collation elements are set to ignorable at levels one through three. In addition, a new final-level weight is appended, whose value depends on the type:
Type L4 Examples
Completely Ignorable 0000 NULL
[.0000.0000.0000.0000]
 
Variable old L1 SPACE
[.0000.0000.0000.0209]
Neither FFFF Capital A
[.06D9.0020.0008.FFFF]

2. To allow for POSIX-style positions:

Blanked

Non-
ignorable

ShiftLow

ShiftHigh

death
de luge
de-luge

deluge
de‐luge
de Luge
de-Luge

deLuge
de‐Luge
demark

de luge
de Luge
de-luge
de-Luge
de‐luge
de‐Luge

death
deluge
deLuge
demark

death
de luge
de-luge
de‐luge

deluge
de Luge
de-Luge
de‐Luge
deLuge

demark

death
deluge
de luge
de-luge
de‐luge

deLuge
de Luge
de-Luge
de‐Luge

demark

4.1 Step 1: Normalize

1. Fix a typo: the text says "three" but there are actually only two bullets after it!

Old

New

Step 1. Produce a normalized form of each input string, applying the following three steps: Step 1. Produce a normalized form of each input string, applying the following two steps:

2. In step 1.1, there is a degenerate case that needs to be caught to make the algorithm more complete. We could have multiple rearranging characters in a row, e.g. suppose that R and are each rearranging.

Old

New

1.2. If any character is marked as rearranging (see §3.1.3 Rearrangement), swap it and the succeeding character. 1.2. If any character is marked as rearranging (see §3.1.3 Rearrangement), swap it and the succeeding character (if there is one). In practice, rearranging characters should never appear adjacent to one another. If for some reason they do, then successive pairs in the sequence will be swapped.

For example, if digits 1-4 were rearranging, then
"1x...12...123y" =>  "x1...21...21y3"

4.2 Step 2: Produce Array

1. Section 3.2 describes that weights are derived if they are not in the table, but this point should be added explicitly in the text of the algorithm in Step 2.4 to avoid confusion.

Old

New

2.4. Fetch the corresponding collation element(s) from the table

2.4. Fetch the corresponding collation element(s) from the table if there is a match. If there is no match, synthesize a weight as described in §7.1 Derived Collation.

7.1 Derived Collation Elements

1. Add out-of-range values, make clear implementations can throw exceptions, and do minor editing on that text for clarity.

Old

New

Non-characters (values of the form xxFFFE or yyFFFF) and unpaired surrogates are ignored: that is, they are mapped to [.0000.0000.0000.].

lllegal codepoints

Certain codepoints are illegal in a data stream. These include non-characters (codepoints with hex values ending in FFFF or FFFE), unpaired surrogates (codepoints between D800 and DFFF), and out-of-range values (< 0 or > 10FFFF). Implementations may also choose to treat these as error conditions and respond appropriately, such as by throwing an exception.

If they are not treated as an error condition, they must be mapped to [.0000.0000.0000.], and thus ignored.

2. Fix the formulation of the paired code values. The problem is that you can't let the second primary weight have a primary weight of 0000! So you have to use the bottom 15 bits instead of 16 to make sure you can turn a bit on to avoid 0000. Also added some more explanation, and changed from the minimal values to simply 0001 (also avoids the zero problem).

Old

New

Unassigned code points (including surrogate pairs) are mapped to a sequence of two collation elements based on their Unicode code point value. (These will include private use characters that are not assigned in the implementation environment.) These code points can range from 0 to 10FFFF16, and are given first collation elements with a range of 17 primary values, and a second collation element with a primary consisting of the remaining bits in the code value:

00XXXX => [.FFE0.0020.0002.][.XXXX.0000.0000.]

...

10XXXX => [.FFF0.0020.0002.][.XXXX.0000.0000.]

 

Legal codepoints

A legal codepoint that is not explicitly mentioned in the table is mapped a sequence of two collation elements in the following way. The result of this process are collation elements that are sorted in codepoint order, that do not collide with any explicit values in the table, and that can be placed anywhere (e.g. at UNDEFINED) with respect to the explicit collation element mappings (by default, they go after all explicit collation elements).

To derive the collation elements, the codepoint is separated into two parts, chosen for the right numerical properties. First, separate off the top 6 bits of the codepoint. Since codepoints can go from 0 to 10FFFF, this will have values from 0 to 2116 (= 3310). Add this to the UNDEFINED.

AAAA = UNDEFINED + (CP >> 15);

Now take the bottom 15 bits of the code point. Turn the top bit on, so that the value is non-zero.

BBBB = (CP & 0x7FFF) | 0x8000;

The mapping given to CP is then given by:

CP => [.AAAA.0001.0001.][.BBBB.0000.0000.]

If a fourth or higher weights are used, then the same pattern is used: they are set to 0001, etc. in the first collation element and 0000 in the second. (Since all distinct codepoints have different AAAA/BBBB combination, the exact values don't matter as long as they are non-zero in the first collation element.)

The value for UNDEFINED for the Default Default Unicode Collation Element Table is FF80. This is larger than any explicit primary weight, and thus will not collide with explicit weights. It is not generally necessary to tailor this value to be within the range of explicit weights. However if this is done, the explicit primary weights must be shifted so that none are between UNDEFINED and UNDEFINED + 34.

3. Right now there is a mismatch between Hangul Syllables and Jamo. This text fixes that:

Old

New

CJK Ideographs and Hangul Syllables are not explicitly mentioned in the default table, and are mapped to collation elements that are derived from their Unicode code point value. The primary weight is equal to their Unicode character value. Their secondary and tertiary values are the respective minimal values for those levels: in the Default Unicode Collation Element Table, these are 0020 and 0002 respectively.

CJK Ideographs and Hangul Syllables are not explicitly mentioned in the default table. CJK ideographs are mapped to collation elements that are derived from their Unicode code point value. The primary weight is equal to their Unicode character value. Their secondary and tertiary values are the respective minimal values for those levels: in the Default Unicode Collation Element Table, these are 0020 and 0002 respectively. All other levels, if any, are 0001. (Since all distinct codepoints have different primary weights, the exact values don't matter as long as they are non-zero.)

The collation algorithm requires that Hangul Syllables be decomposed. However, if the table is tailored so that the primary weights for Hangul Jamo (and all related characters) are adjusted, then the Hangul Syllables can be left precomposed and treated in the same way as CJK ideographs. That will provide a collation which is approximately the same as UCA, and may be sufficient in environments where individual jamo are not expected.

The adjustment is to move each initial jamo (and related characters) to have a primary weights corresponding to the first syllables starting with that jamo, and make all non-initial jamo (and related characters) be ignorable at a primary level.

Data

There are a number of issues with the data that need to be fixed. Some of these are listed in tables in http://www.macchiato.com/unicode/CollationMismatch.html. The format of that table will be discussed in the meeting.

  1. Omission of the contractions for Tibetan two-part vowels
  2. Consistency with Normalization form C (need to handle the composition exclusions listed in CollationMismatch.html under Mismatches with Form C)
  3. Consistency with Normalization form D (need to add decompositions that changed from 2.1.9 to 3.0, fixing the mismatches in CollationMismatch.html under Mismatches if Normalization D is OFF)
  4. Addition of collation elements for the new 3.0 characters

We plan to do #1 in the immediate future, but do the others all together later on.

Wordsmithing

The text may need additional changes for clarity beyond what is noted above. These would be approved by the editorial committee, as usual.