Draft Unicode Technical Report #10

Unicode Collation Algorithm

Revision 4
Authors Mark Davis and Ken Whistler
Date 1999-06-23
This Version http://www.unicode.org/unicode/reports/tr10/tr10-4.html
Previous Version http://www.unicode.org/unicode/reports/tr10/tr10-2.html
Latest Version http://www.unicode.org/unicode/reports/tr10/
Unicode Technical Reports http://www.unicode.org/unicode/reports/

This report presents the specifications of the Unicode Collation Algorithm. It is divided into the following sections:

Status of this document

This document has been considered and approved by the Unicode Technical Committee for publication as a Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the Standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative or as normative specification. Please mail corrigenda and other comments to errata@unicode.org.

§1. Scope

The Unicode Collation Algorithm describes how to compare two Unicode strings, using the data in a Collation Element Table, and consistent with the linguistic requirements of The Unicode Standard, Version 3.0, Section 5.XX Sorting and Searching. (Readers should be familiar with that section before proceeding.) It also supplies a Default Unicode Collation Element Table as the basis for collation. (Note: this file is currently limited to the Unicode 2.1.9 repertoire. It will be updated to Unicode 3.0 in the future.)

§1.1 Goals

The algorithm is designed to satisify the following goals.

  1. A complete, unambiguous, specified ordering for all characters in Unicode.
  2. A complete resolution of the handling of canonical and compatibility equivalences as relates to the default ordering.
  3. A complete specification of the meaning and assignment of collation levels, including whether a character is ignorable by default in collation.
  4. A complete specification of the rules for using the level weights to determine the default collation order of strings of arbitrary length.
  5. A mechanism (tailoring) to override the default order to create language-specific orderings in detail.
  6. An algorithm that can be efficiently implemented, both in terms of performance and in terms of memory requirements.

Given the standard ordering and the tailoring for any particular country, any two companies or individuals--with their own proprietary implementations--could take any arbitrary Unicode input and produce exactly the same sorted output. In addition, when given a tailoring specifying French accents this algorithm passes the Canadian (and ISO DIS 14651) benchmark, as mentioned in the section on Processing French Accents.

§1.2 Non-Goals

The Default Unicode Collation Element Table explicitly does not provide for the following features:

  1. reversibility: from a Collation Element you are not guaranteed that you can recover the original character.
  2. numeric formatting: numbers composed of a string of digits or other numerics will not necessarily sort in numerical order.
  3. API: no particular API is specified or required for the algorithm.
  4. title sorting: for example, removing articles such as a and the during biblipgraphic sorting is not provided.

§1.4 Summary

Briefly stated, the Unicode Collation Algorithm takes an input Unicode string and a Collation Element Table, containing mapping data for characters. It produces a sort key, which is an array of unsigned 16-bit integers. Two or more sort keys so produced can then be binary-compared to give the correct comparison between the strings for which they were generated.

The Unicode Collation Algorithm assumes multiple-level key weighting, along the lines widely implemented in IBM technology, and as described in the Canadian sorting standard (CAN/CSA Z243.4.1) and the proposed International String Ordering standard (ISO/IEC 14651).

By default, the algorithm makes use of three fully-customizable levels. For the Latin script, these levels correspond roughly to:

  1. alphabetic ordering
  2. diacritic ordering
  3. case ordering.

A fourth level may be used for tie-breaking between strings not distinguished at the first three levels.

This design allows implementations to produce culturally acceptable collation, while putting the least burden on implementations in terms of memory requirements and performance. In particular, Collation Element Tables only require storage of 32 bits per significant character.

However, implementations of the Unicode Collation Algorithm are not limited to supporting only 3 levels. They are free to support a fully customizable 4th level (or more levels), as long as they can produce the same results as the basic algorithm, given the right Collation Element Tables. For example, an application which uses the algorithm, but which must treat some collection of special characters as ignorable at the first 3 levels and must have those specials collate in non-Unicode order (as, for example to emulate an existing EBCDIC-based collation), may choose to have a fully customizable 4th level. The downside of this choice is that such an application will require more storage, both for the Collation Element Table and in constructed sort keys.

The Collation Element Table may be tailored to produce particular culturally required orderings for different languages or locales. As in the algorithm itself, the tailoring can provide full customization for three (or more) levels.

§2. Conformance

There are many different ways to compare strings, and the Unicode Standard does not restrict the ways in which implementations can do this. However, any Unicode-conformant implementation that purports to implement the Unicode Collation Algorithm must do so as described in this document.

The algorithm is a logical specification, designed to be straightforward to describe. Actual implementations of the algorithm are free to change any part of the algorithm so long as any two strings compared by the implementation are ordered the same as they would be by the algorithm. They are also free to use a different format for the data in the Collation Element Table. The sort key is also a logical intermediate object: so long as an implementation produces the same results in comparison of strings, the sort keys can differ in format from what is specified here. (See Implementation Notes.)

The requirements for conformance on implementations of the Unicode Collation Algorithm are as follows:

C1 

Given a well-formed Unicode Collation Element Table, a conformant implementation shall replicate the same comparisons of Unicode strings as those produced by §4 Main Algorithm.

In particular, a conformant implementation must be able to compare any two canonical equivalent strings as being equal, for all Unicode characters supported by that implementation.

C2 

A conformant implementation shall support at least three levels of collation.

A conformant implementation is only required to implement three levels. However, it may implement four (or more) levels if desired.

C3 

A conformant implementation that supports backward levels, alternate weighting or rearrangement shall do so in accordance with this specification.

A conformant implementation is not required to support these features; however, if it does so, it must interpret them properly.

§3. Collation Element Table Format

A Collation Element Table contains a mapping from one (or more) characters to one (or more) collation elements. A collation element is an ordered list of three 16-bit weights. (Implementations can produce the same result without using 16-bit weights--see Implementation Notes.)

The first weight is called the Level 1 weight (or primary weight), the second is called the Level 2 weight (secondary weight), and the third is called the Level 3 weight (tertiary weight). For a collation element X, these can be abbreviated as X1, X2, and X3.Given two collation elements X and Y, we will use the following notation:

Notation

Reading

Meaning

X <1 Y X is primary less than Y X1 < Y1
X =1 Y X is primary equal to Y X1 = Y1
X <2 Y X is secondary less than Y X =1 Y and X2 < Y2
X =2 Y X is secondary equal to Y X =1 Y and X2 = Y2
X <3 Y X is tertiary less than Y X =2 Y and X3 < Y3
X =3 Y X is tertiary equal to Y X =2 Y and X3 = Y3
X <4 Y X is quarternary less than Y X =3 Y and X4 < Y4
X =4 Y X is quarternary equal to Y X =3 Y and X4 = Y4

The collation algorithm results in a similar ordering among characters and strings, so that for two strings A and B we can write A <2 B, meaning that A is less than B and there is only a secondary difference between them. If two strings have no primary, secondary or tertiary difference according to a given Collation Table, then we write A =3 B. If two strings are equivalent (equal at all levels) according to a given Collation Table, we write A = B. If they are bit-for-bit identical, we write A == B.

Note: Where subscripts are not available, forms such as "X <1 Y" or "X1" can be written as "X <[1] Y" and "X[1]" respectively.

If a weight is 0000, than that collation element is ignorable at that level: the weight at that level is not taken into account in sorting. A collation element that is ignorable at Level 1, but not at Level 2 is called a Level 1 ignorable; similarly, a Level 2 ignorable is ignorable at Levels 1 and 2, but not Level 3; a Level 3 ignorable is ignorable at Levels 1, 2, and 3 but not Level 4; and so on. A collation element that is not ignorable at Level 1 is called a non-ignorable. A collation element with zeros at every level is called completely ignorable.

No collation element can have a zero weight at Level N and a non-zero weight at Level N-1. Any collation elements that violate this rule are called ill-formed. The reason for this will be explained below, under Step 4 of the main algorithm.

For a given Collation Element Table, MINn is the least weight in any collation element at level n, and MAXn is the maximum weight in any collation element at level n.

The following are sample collation elements that are used in the examples illustrating the algorithm. Unless otherwise noted, all weights are in hexadecimal format.

Sample Table:

Character

Collation Element

Name

0300 "`" [0000.0021.0002] COMBINING GRAVE ACCENT
0061 "a" [06D9.0020.0002] LATIN SMALL LETTER A
0062 "b" [06EE.0020.0002] LATIN SMALL LETTER B
0063 "c" [0706.0020.0002] LATIN SMALL LETTER C
0043 "C" [0706.0020.0008] LATIN CAPITAL LETTER C
0064 "d" [0712.0020.0002] LATIN SMALL LETTER D

§3.1 Linguistic Features

Linguistic requirements of collation are covered in more detail in The Unicode Standard, Version 3.0, Section 5.XX Sorting and Searching.

§3.1.1 Multiple Mappings

The mapping from characters to collation elements may not be a simple mapping from one character to one collation element: in general, it may map from one to many, from many to one, or from many to many. For example:

Expansions

The Latin letter æ is treated as an independent letter by default. Collations such as English, which may require treating it as equivalent to an <a e> sequence, can tailor the letter to map to the following collation elements, such as in the following example:

Character

Collation Element

Name

00E6 [06D9.0020.0002], [073A.0020.0002] LATIN SMALL LETTER AE; "æ"

In this example, the collation element [06D9.0020.0002] gives the weight values for a, and the collation element [073A.0020.0002] gives the weight values for e.

Contractions

Similarly, where ch is treated as a single letter as in traditional Spanish, it is represented as a mapping from two characters to a single Collation Element, such as in the following example

Character

Collation Element

Name

0063
0068
[0707.0020.0002] LATIN SMALL LETTER C,
LATIN SMALL LETTER H;
"ch"

In this example, the collation element [0707.0020.0002] has a primary value one greater than the primary value for the letter c by itself, so that the sequence ch will collate after c and before d. The above example shows the result of a tailoring of collation elements to weight sequences of letters as a single unit. The Default Unicode Collation Element Table does not contain any contractions. (However, tailored tables may have contractions, and contractions may be added in a future version of the Default Unicode Collation Element Table.)

Any character (such as soft hyphen) between two characters of a contraction will cause them to sort as separate characters.

Note: The Unicode Collation Algorithm handles surrogate pairs as contractions. It thus provides the same level of capability for those characters as for non-surrogates.

Other Multiple Mappings

Certain characters may both expand and contract: see Section 5.XX Sorting and Searching.

§3.1.2 French Accents

In some languages (notably French), accents are sorted from the back of the string to the front of the string. This behavior is not marked in the Default Unicode Collation Element Table, but may occur in tailored tables. In such a case, the collation elements for the accents and their base characters are marked as being backwards at Level 2.

§3.1.3 Rearrangement

Certain characters are not coded in logical order. In Unicode 2.1, these are the Thai vowels 0E40 through 0E44 and the Lao vowels 0EC0 through 0EC4 (this list may be extended in the future, and so is included in the Default Unicode Collation Element Table). For collation, they are rearranged by swapped with the following character before further processing, since logically they belong afterwards. For example, here is a string processed by rearrangement:

input string: 0E01 0E40 0E02 0E03
normalized string: 0E01 0E02 0E40 0E03

§3.1.4 Default Values

Both in the Default Unicode Collation Element Table and in typical tailorings, most unaccented letters differ in the primary weights, but have secondary weights (such as a1) equal to MIN2. The primary ignorables will have secondary weights greater than MIN2. Characters that are compatibility or case variants will have equal primary and secondary weights (e.g. a2 = A2), but have different tertiary weights (e.g. a3 < A2). The unmarked characters will have tertiary weights (such as a3) equal to MIN3.

However, a well-formed Unicode Collation Element Table does not guarantee that the meaning of a secondary or tertiary weight is uniform across tables. For example, a capital A and katakana ta could both have a tertiary weight of 3.

§3.2 Default Unicode Collation Element Table

The Default Unicode Collation Element Table is available in two parts:

http://www.unicode.org/unicode/reports/tr10/basekeys.txt

http://www.unicode.org/unicode/reports/tr10/compkeys.txt

The file basekeys.txt lists all the Unicode characters that correspond to a single collation element. The file compkeys.txt lists all the Unicode characters that correspond to a sequence of collation elements.

This table is constructed to be consistent with the Unicode Canonical Equivalence algorithm, and to respect the Unicode character properties. It is not, however, merely algorithmically derivable from those data, since the assignment of levels does take into account characteristics of particular scripts. For example, in general the combining marks are Level 1 ignorables; however, the Indic combining vowels are given non-zero Level 1 weights, since they are as significant in sorting as the consonants.

Any character may have variant forms or applied accents which affect collation. Thus, for FULL STOP there are three compatibility variants, a fullwidth form, a compatibility form, and a small form. These get different tertiary weights, accordingly. For more information on how the table was constructed, see Weight Derivation.

The default table contains no French accents or expansions. For many languages, some degree of tailoring is required to match user expectations.

§3.2.1 File Format

Each of the files consists of a version line followed by an optional alternative-weight line, optional rearrangement lines, optional backwards lines, and a series of entries, all separated by newlines. A '%' and any following characters on a line are comments. Whitespace between literals is ignored. The following is an extended BNF description of the format, where "x+" indicates one or more x's, "x*" indicates zero or more x's, "x?" indicates zero or one x, and <char> is a hexadecimal Unicode code value.

<collationElementTable> := <version> 
                           <alternate>?
                           <backwards>*
                           <rearrangement>*
                           <entry>+

The version line is of the form:

<version> := <major>.<minor>.<variant> <eol>

The rearrangement line is of the following form. It specifies the characters that are to be rearranged in collation, as discussed above in §3.1.3 Rearrangement.

<rearrangement> := '@rearrange' <charList> <eol>
<charList> := <item> ("," <item> )*

In general, the rearrangement lines are determined by the version of the Unicode Standard, since these are properties always associated with certain characters. For Unicode 3.0, they are the following:

@rearrange 0E40,0E41,0E42,0E43,0E44
@rearrange 0EC0,0EC1,0EC2,0EC3,0EC4

The alternate-weight line has three possible values that may change the weights of collation elements in processing (see §3.2.2 Alternate Collation Elements).

<alternate> := '@alternate ' <alternateChoice> <eol>
<alternateChoice> := 'default' | 'non-ignorable' | 'shifted'

A backwards line lists a level that is to be processed in reverse order.

<backwards> := '@backwards' <levelNumber> <eol>

Each entry is a mapping from character(s) to collation element(s), and is of the following form:

<entry> := <charList> ';' <collElement>+ <eol>
<collElement> := "[" <var> <char> "." <char> "." <char> ("." <char>) "]"
<var> := "*" | "."

In the Default Unicode Collation Element Table, the comment may contain informative tags: CANONSEQ indicates a canonical equivalence, and COMPATSEQ a compatibility equivalence. Here are some selected entries:

0020 ; [*0209.0020.0002.0020] % SPACE
02DA ; [*0209.002B.0002.02DA] % RING ABOVE; COMPATSEQ
0041 ; [.06D9.0020.0008.0041] % LATIN CAPITAL LETTER A
3373 ; [.06D9.0020.0017.0041] [.08C0.0020.0017.0055] % SQUARE AU; COMPATSEQ
00C5 ; [.06D9.002B.0008.00C5] % LATIN CAPITAL LETTER A WITH RING ABOVE; CANONSEQ
212B ; [.06D9.002B.0008.212B] % ANGSTROM SIGN; CANONSEQ
0042 ; [.06EE.0020.0008.0042] % LATIN CAPITAL LETTER B
0043 ; [.0706.0020.0008.0043] % LATIN CAPITAL LETTER C
0106 ; [.0706.0022.0008.0106] % LATIN CAPITAL LETTER C WITH ACUTE; CANONSEQ
0044 ; [.0712.0020.0008.0044] % LATIN CAPITAL LETTER D

The entries in each file are ordered by collation element, not by character. This makes it easy to see the order in which characters would be collated. Although this document describes collation elements as three levels, the file contains a fourth level (as in [.0712.0020.0008.0044]) which is identical with the character (or ignorable). For more information on that, see Avoiding Normalization under Implementation Notes.

Implementations can also add more customizable levels, as discussed above under conformance. For example, an implementation might want to be capable not only of handling the standard Unicode Collation, but also capable of emulating an EBCDIC multi-level ordering (having a fourth-level EBCDIC binary order).

Note: The precise values of the collation elements for the characters may change over time as new characters are added to the Unicode Standard. If a reference to the precise values is required, the version number must be referenced.

§3.2.2 Alternate Weighting

Collation elements that are marked with an asterisk in a Unicode Collation Element Table are known as variable collation elements.

Character

Collation Element

Name

0020 " " [*0209.0020.0002] SPACE

Based on the setting of the alternate weighting tag, collation elements can be reweighted in the following ways:

Note:

The shifted tag provides for improved orderings when the variable collation elements are ignorable, while still using only requiring three fields to be stored in memory for each collation element. It does result in somewhat longer sort keys, although they can be compressed (see §6.1 Reducing Sort Key Lengths and §6.3 Reducing Table Sizes).

Here are three examples of orderings using the different options, where '-' is either a hyphen-minus or a hyphen:

default non-ignorable shifted
a b
a-b
ab
a-b
a B
a-B
aB
a-B
a b
a B
a-b
a-B
a-b
a-B
ab
aB
a b
a-b
a-b
ab
a B
a-B
a-B
aB

§4 Main Algorithm

The main algorithm has four steps. First is to normalize each input string, second is to produce an array of collation elements for each string, and third is to produce a sort key for each string from the collation elements. Two sort keys can then be compared with a binary comparison; the result is the ordering for the original strings.

Step 1. Normalize each input string

Produce a normalized form of each input string, applying the following three steps:

Example:

input string: càb
normalized string: ca`b

Note: Step 1 is transforming the string into its canonical decomposition, as specified in The Unicode Standard, Version 3.0. The canonical decomposition is also called Normalization Form D (Unicode Technical Report #15).

Note:

If you are only handling a restricted subset of 10646/Unicode which does not include combining marks, you can omit Step 1.1. See the Implementation Notes.

To allow for such omission, the Default Unicode Collation Element Table does define values for characters (such as accented characters or Hangul Syllables) which would otherwise be normalized to a sequence of characters. For example,

Character

Collation Element

Name

00E0 "à" [06D9.0021.0002] LATIN SMALL LETTER A WITH GRAVE

Step 2. Produce an array of collation elements for each string

First we define a combining mark in a string to be blocked if there is another combining mark of the same canonical combining class or zero between it and its base character.

The collation element array is built by sequencing through the normalized form as follows:

Example:

normalized string: ca`b
collation element array: [0706.0020.0002], [06D9.0020.0002], [0000.0021.0002], [06EE.0020.0002]

Note: On some browsers the grave (`) will display as a single left quote. It should be read as a grave accent for the purposes of this document.

Note:

The reason for considering the extra combining marks C is that otherwise irrelevant characters could interfere with matches in the table. For example, suppose that the contraction <a, combining_ring> (= å) is ordered after z. If a string consists of the three characters <a, combining_ring, combining_cedilla>, then the normalized form is <a, combining_cedilla, combining_ring>, which separates the a from the combining_ring. If we didn't have the step of considering the extra combining marks, this string would compare incorrectly as after a and not after z.

If the desired ordering treats <a, combining_cedilla> as a contraction which should take precedence over <a, combining_ring>, then an additional mapping for the combination <a, combining_ring, combining_cedilla> can be introduced to produce this effect.

Note:  For conformance to Unicode canonical equivalence, only unblocked combining marks are matched. For example, <a, combining_macron, combining_ring> would compare as after a-macron, and not after z. As in the previous note, additional mappings can be added to customize behavior.

Step 3. Form a sort key for each string

The sort key is formed by successively appending weights from the collation element array. The weights are appended from each level in turn, from 1 to 3. (Backwards weights are inserted in reverse order.)

An implementation may allow the maximum level to be set to a smaller level than the available levels in the collation element array. For example, if the maximum level is set to 2, then level 3 and higher weights (including the normalized Unicode string) are not appended to the sort key. Thus any differences at levels 3 and higher will be ignored, leveling any such differences in string comparison.

Here is a more detailed statement of the algorithm:

Note:

The level separator is zero (0000), which is guaranteed to be lower than any weight in the resulting sort key. This guarantees that when two strings of unequal length are compared, where the shorter string is a prefix of the longer string, the longer string is always sorted after the shorter (in the absence of special features like contractions). For example:

"cab" < "cabX" where "X" can be any character(s)

Note: Even if two different Unicode strings do not differ in level 1, 2 or 3 weights, the addition of the normalized string at the end will distinguish between them, unless they normalize to precisely the same string. This is how the normalized string defines a fourth level.

Example:

collation element array: [0706.0020.0002], [06D9.0020.0002], [0000.0021.0002], [06EE.0020.0002]
sort key: 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002

Step 4. Compare the sort keys

Compare the sort keys for each of the imput strings, using a binary comparison. This means that:

Example:

String

Sort Key

cab 0706 06D9 06EE 0000 0020 0020 0020 0000 0002 0002 0002
Cab 0706 06D9 06EE 0000 0020 0020 0020 0000 0008 0002 0002
càb 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002
dab 0712 06D9 06EE 0000 0020 0020 0020 0000 0002 0002 0002

In this example, "cab" <3 "Cab" <2 "càb" <1 "dab". The differences that produce the ordering are shown by the bold underlined items:

Note:

At this point we can explain the reason for disallowing ill-formed weights. If ill-formed weights were allowed, the ordering of elements can be incorrectly reflected in the sort key. For example, suppose the secondary weights of the Latin characters were zero (ignorable) and that (as normal) the primary weights of case-variants are equal: that is, a1 = A1. Then the following incorrect keys would be generated:

  1. "áe" = <a, acute, e> => [a1 e1 0000 acute2 0000 a3 acute3 e3...]
  2. "Aé" = <A, e, acute> => [a1 e1 0000 acute2 0000 A3 acute3 e3...]

Since the secondary weights for a, A, and e are lost in forming the sort key, the relative order of the acute is also lost, resulting in an incorrect ordering based solely on the case of A vs a. With well-formed weights, this does not happen, and you get the following correct ordering:

  1. "áe" = <A, e, acute> => [a1 e1 0000 a2 e2 acute2 0000 a3 acute3 e3...]
  2. "Aé" = <a, acute, e> => [a1 e1 0000 a2 acute2 e2 0000 A3 acute3 e3...]

However, there are circumstances--typically in expansions--where higher-level weights in collation elements can be zeroed (resulting in ill-formed collation elements) without consequence (see Implementation Notes). Implementations are free to do this as long as they produce the same result as with well-formed tables.

§5 Tailoring

Tailoring is any well-defined syntax that takes the Default Unicode Collation Element Table and produces another well-formed Unicode Collation Element Table. Such syntax will usually allow for the following capabilities:

  1. Reordering any character (or contraction) with respect to others in the standard ordering. Such a reordering can represent a Level 1 difference, Level 2 difference, Level 3 difference, or identity (in levels 1 to 3). Since such reordering include sequences, arbitrary multiple mappings can be specified.
  2. Setting levels to be backwards (French) or forwards (normal). Typically this is only needed for the secondary level.
  3. Set alternate weighting options.
  4. Customizing the exact list of rearranging characters.
  5. Customizing the exact list of variable collation elements.

§5.1 Preprocessing

In addition to tailoring, some implementation may choose to preprocess the text for special purposes. Once such preprocessing is done, the standard algorithm can be applied.

Examples include:

Such preprocessing is outside of the scope of this document.

§6 Implementation Notes

As noted above for efficiency, implementations may vary from this logical algorithm so long as they produce the same result. The following items discuss various techniques that can be used for reducing sort key length, and customizing for addtional environments. The topics include:

§6.1 Reducing Sort Key Lengths

Eliminating level separators. Level separators are not needed between two levels in the sort key, if the weights are properly chosen. For example, if all L3 weights are less than all L2 weights, then no level separator is needed between them. The actual numerical values chosen for weight values in the Default Unicode Collation Element Table meet this criterion, so that by default no level separators are required between Level 1 and Level 2 or between Level 2 and Level 3 to produce proper sort keys. For example, here is a sort key with these level separators removed.

String

Sort Key

càb (0) 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002
càb (1) 0706 06D9 06EE 0020 0020 0021 0020 0002 0002 0002 0002

L2/L3 in 8 bits. The L2 and L3 weights commonly are small values. Where that condition occurs for all possible values, they can then be represented as single 8-bit quantities. Here is the above example with both these changes (and grouping by bytes).

String

Sort Key

càb (0) 07 06 06 D9 06 EE 00 00 00 20 00 20 00 21 00 20 00 00 00 02 00 02 00 02 00 02
càb (1,2) 07 06 06 D9 06 EE 20 20 21 20 02 02 02 02

Machine words. The sort key can be represented as an array of different quantities depending on the machine architecture. For example, comparisons as arrays of 32-bit quantities may be much faster on some machines. If this is done, the orginal is to be padded with trailing zeros as necessary.

String

Sort Key

càb (1,2) 07 06 06 D9 06 EE 20 20 21 20 02 02 02 02
càb (1,2,3) 070606D9 06EE2020 21200202 02020000

Run-length Compression. Generally sort keys don't differ much in the secondary or tertiary weights, so you tend to end up with keys with a lot of repetition. This also occurs with quarternary weights generated with the shifted parameter. By the structure of the collation element tables, there are also many weights that are never assigned at a given level in the sort key. For example, no tertiary values are greater than 18 in the Default Unicode Collation Element Table. You can take advantage of these regularities in these sequences to compact the length--while retaining the same sort sequence--by using the following technique. This is a logical statement of the process: the actual implementation can be much faster and performed as the sort key is being generated.

Examples: low weights 01, 02; high weights FE, FF; common weight 77

Original Weights

Compressed Weights

01
02
77 01
77 02
77 77 01
77 77 02
77 77 77 01
77 77 77 02
...
77 77 77 FE
77 77 77 FF
77 77 FE
77 77 FF
77 FE
77 FF
FE
FF
01
02
03 01
03 02
04 01
04 02
05 01
05 02
...
FB FF
FB FF
FC FF
FC FF
FD FF
FD FF
FF
FF

The result of this process are keys that are never greater than the original, are generally much shorter, and result in the same comparisons.

§6.2 Large Weight Values

If a collation sequence requires more than 65,535 weight values (or 65,024 values where zero bytes are avoided), this can still be accomodated by using multiple collation elements for a single character. For example, suppose that 50,000 UTF-16 (surrogate) private use characters are assigned in a particular implementation, and that these are to be sorted after MAX1. Simply assign them all dual collation elements of the form [(MAX1+1).0000.0000], [yyyy.zzzz.wwww]. They will then sort properly with respect to each other and to the rest of the characters. (The first collation element is one of the instances where ill-formed collation elements are allowed. Since the second collation element is well-formed and the first element will only occur in combination, ordering is preserved.)

§6.3 Reducing Table Sizes

Contiguous weight ranges. The Default Unicode Collation Element Table has secondary weights that are greater than 00FF. This is the result of the derivation described in Weight Derivation. However, these values can be compacted to a range of values that don't exceed 00FF. Whenever collation elements have different primary weights, the ordering of their secondary weights is immaterial. Thus all of the secondaries that share a single primary can be renumbered to a contiguous range without affecting the resulting order.

For example, for the primary value 0820 (for the letter O), there are 31 distinct secondary values ranging from 0020 to 012D. These can be renumbered to the contiguous range from 0020 to 003F, which is less than 00FF.

Escape hatch. Although the secondary and tertiary weights for the Default Unicode Collation Element Table can both fit within one byte, of course, any particular tailored table could conceivably end up with secondary or tertiary weights that exceed what can be contained in a single byte. However, the same technique used for large weight values can also be used for implementations that do not want to handle more than 00FF values for a particular weight.

For example, the Java collation implementation only stores 8-bit quantities in level 2 and level 3. However, characters can be given L2 or L3 weights with greater values by using a series of two collation elements. For example, with characters requiring 2000 weights at L2, then 248 characters can be given single keys, while 1792 are given 2 collation keys of the form [yyyy.00zz.00ww] [0000.00nn.0000]. (The 248 can be chosen to be the higher frequency characters!)

Leveraging Unicode tables. Since the collation element(s) of a decomposable character in the standard table can be algorithmically generated from the decomposition, no collation elements need to be stored for decomposable characters: the collation elements can be generated on the fly. The collation elements for the Han characters (unless tailored) can also be algorithmically derived; no collation elements need to be stored for them either. This means that only a fraction of the total number of Unicode characters needs to have an explicit collation element associated with them.

Memory table size. Applying the above techniques, an implementation can thus safely pack all of the data for a collation element into a single 32-bit quantity: 16 for the primary, 8 for the secondary and 8 for the tertiary. Then applying techniques such as the Two-Stage table approach described in Section 5.7 of The Unicode Standard, Version 2.0, the mapping table from characters to collation elements can both fast and small. This has very important implications for implementations.

§6.4 Avoiding Zero Bytes

If the resulting sort key is to be a C-string, then zero bytes must be avoided. This can be done by:

§6.5 Avoiding Normalization

Implementations that do not handle separate combining marks can map precomposed characters (such as "à") to collation elements with different Level 2 weights for the different accents. This can give the same results as long as no separate combining marks are in the string.

§6.6 Case Comparisons

In some languages, it is common to sort lowercase before uppercase; in other languages this is reversed. Often this is more dependent on the individual concerned, and is not standard across a single language. It is strongly recommended that implementations provide parameterization that allow uppercase to be sorted before lowercase, and provide information as to the standard (if any) for particular countries. This can easily be done to the Default Unicode Collation Element Table before tailoring by remapping the L3 weights (see Weight Dervation). It can be done after tailoring by finding the case pairs and swapping the collation elements.

§6.7 Incremental Comparison

Implementations do not actually have to produce full sort keys. Collation elements can be generated as needed from two strings, and compared with an algorithm that produces the same results as sort keys would have.

However, it is very tricky to produce an incremental comparison that produces correct results. For example, some implementations have not even been transitive! Be sure to test any code for incremental comparison thoroughly.

§6.8 Searching

The collation elements can also be used for searching, so that a proper native-language match is produced. For example, "ß" will properly match against "ss". Notice that the lengths of the matching strings may differ! Users of search algorithms should be allowed to modify the comparison strength, thus excluding differences at less significant levels. This is especially useful for searching, but can also apply to comparison.

Excluding differences at Level 3 has the effect of ignoring case and compatibility format distinctions between letters when searching. Excluding differences at Level 2 has the effect of ignoring accentual distinctions when searching.

§6.9 Catching Mismatches

Sort keys from two different tailored collations cannot be compared, since the weights may end up being rearranged arbitrarily. To catch this case, implementations can produce a hash value from the collation data, and prepend it to the sort key. Except in extremely rare circumstances, this will distinguish the sort keys. The implementation then has the opportunity to signal an error.

§6.10 Tailoring Example: Java

Java 2 implements a number of the tailoring features described in this document. The following summarizes these features (for more information, see Collator on http://java.sun.com/products/jdk/1.2/docs/api/).

  1. Java doesn't use a default table in the Unicode Collation Element format: instead it always uses a tailoring syntax. Here is a description of the entries:
    Syntax Description Action: set CEx,1 so that:
     & y < x Make x primary-greater than y CEx,1 = CEy,1 + 1;
    CEx,2 = 1;
    CEx,3 = 1;
     & y ; x Make x secondary-greater than y CEx,1 = CEy,1;
    CEx,2 = CEy,2 + 1;
    CEx,3 = 1;
     & y , x Make x tertiary-greater than y CEx,1 = CEy,1;
    CEx,2 = CEy,2;
    CEx,3 = CEy,3 + 1;
     & y = x Make x equal to y CEx,1 = CEy,1;
    CEx,2 = CEy,2;
    CEx,3 = CEy,3;

    Either x or y can be more than one character, to handle contractions and expansions. In that case, the actions become more complicated. NULL is completely ignorable, so by using the above operations, various levels of ignorable characters can be specified. This is used instead of the variable weight mechanism.
     
  2. Entries can be abbreviated in a number of ways. They do not need to be separated by newlines. Characters can be specified directly, instead of using their hexadecimal Unicode values. Wherever you have rules of the form "x < y & y < z", you can omit "& y", leaving just "x < y < z". These can be done successively, so the following are equivalent in ordering.

    Java

    Unicode Collation Element Table

     a, A ; à, À < b, B
    0061 ; [.0001.0001.0001] % a
    0040 ; [.0001.0001.0002] % A
    00E0 ; [.0001.0002.0001] % à
    00C0 ; [.0001.0002.0002] % à
    0042 ; [.0002.0001.0001] % b
    0062 ; [.0002.0001.0002] % B

     
  3. In the current release version of Java, rearrangement for Thai is not supported, and the default ordering is a subset of the Unicode Default Collation Table, with some additional differences. Characters not in the default ordering are sorted by Unicode code value. A fourth level is formed by appending the normalized Unicode string.

§7 Weight Derivation

This section describes the generation of the Unicode Default Unicode Collation Element Table. Certain weights of characters are algorithmically generated from the Unicode Character Database on ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt (documented in ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html).

§7.1 Ideographs

CJK Ideographs are given a primary weight equal to their Unicode character value. Their secondary and tertiary values are the respective minimal values for those levels: in the Default Unicode Collation Element Table, these are 0020 and 0002 respectively.

§7.2 Tertiary Weights for Non-Decomposibles

Characters without decompositions are given tertiary weights according to their General Category in the Unicode Character Database:

 General Category

 Weight

 Lu (uppercase Letter)  0008
 anything else  0002

§7.2 Secondary Weights for Decomposibles

Characters whose decompositions are of the form base character + multiple non-spacing mark(s) have primary and tertiary weights which are the same as those of the base character. They are given a secondary weight which is derived from the secondary weights of the sequence of non-spacing marks according to a Secondary Weight Table.

The Secondary Weight Table is derived in the following way.

For example, here is the derivation for 1F83 GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI in the Default Unicode Collation Element Table. Notice that the base character's original secondary weight (0020 = MIN2) is completely disregarded in this process.

Character

Collation Element

Resulting Collation Element

Name

03B1 [0961.0020.0002] [0961.0020.0002] GREEK SMALL LETTER ALPHA
+ 0314 + [0000.0034.0002]   COMBINING REVERSED COMMA ABOVE
= Dasia
map(0020,0034) => 0034 [0961.0034.0002]
+ 0300 + [0000.0021.0002]   COMBINING GRAVE ACCENT = Varia
map(0034,0021) => 011B [0961.011B.0002]
+ 0345 + [0000.0061.0002]   COMBINING GREEK YPOGEGRAMMENI
= Iota subscript
map(011B,0061) => 0131 [0961.0131.0002]

If a Collation Element Table has been tailored so that non-spacing marks are reordered, then this process would result in a different set of weights in the Secondary Weight Table.

§7.3 Other Tertiary Weights

Characters with other decompositions are given tertiary weights based upon their compatibility tag. The collation element is the one or more collation elements of the decomposition, with the tertiary weight reset according to the following table:

Tertiary Weight Table

Original Weight

Compatility Tag

Resulting Weight

 0002  NONE  0002
 0002  <wide>  0003
 0002  <narrow>  0004
 0002  <compat>  0005
 0002  <font>  0006
 0002  <circle>  0007
 0008  NONE  0008
 0008  <wide>  0009
 0008  <compat>  000A
 0008  <font>  000B
 0008  <circle>  000C
 0002  <super>  000D
 0002  <sub>  000E
 0002  <small>  000F
 0002  <vertical>  0010
 0002  <initial>  0011
 0002  <medial>  0012
 0002  <final>  0013
 0002  <isolated>  0014
 0002  <noBreak>  0015
 0002  <square>  0016
 0008  <square>  0017
 0002  <fraction>  0018

For example, from the Unicode Character Property database, SQUARE PA AMPS has a decomposition of 0070 0041, with a compatibility tag of <square>. The collation elements for this sequence is [.083C.0020.0002] [.06D9.0020.0008]. The tertiary weights are then reset according to the Tertiary Weight Table, resulting in the collation elements [.083C.0020.0016][.06D9.0020.0017].

§7.4 Digits

Digits are given special-case handling. The primary value is assigned based on the numerical value of the digit, and secondary values are derived from the script differences. Numerics that have compatibility equivalents are given weights based upon those equivalents; for example, ROMAN NUMERAL IX is given weights near those of the sequence "I", "X". Numerics that are neither digits nor have compatibility equivalents are treated as miscellaneous symbols.

§8 Frequently Asked Questions

Q: Does JIS require tailorings?

A: The Default Unicode Collation Element Table uses the Unicode order for CJK ideographs (Kanji). This represents a radical-stroke ordering for the characters in JIS levels 1 and 2. If a different order is needed, such as an exact match to binary JIS order for these characters, that can be achieved with tailoring.

Q: How are Hiragana readings handled for Kanji?

A: There is no algorithmic mapping from Kanji characters to the phonetic readings for those characters, because there is too much linguistic variation. The common practice for sorting in a database by reading is to store the reading in a separate field, and construct the sort keys from the readings.

Q: Is transitive consistency maintained?

A: Yes, for any strings A, B, and C, if A < B and B < C, then A < C. However, implementors must be careful to produce implementations that accurately reproduce the results of the Unicode Collation Algorithm as they optimize their own algorithms. It is easy to perform careless optimizations--especially with Incremental Comparison algorithms--that fail this test. Other items to check are the proper distinction between the bases of accents, such as in the following example:

<u-macron, u-diaeresis-macron> <2 <u-macron-diaeresis, u-macron>.

Q: How are mixed Japanese and Chinese handled?

A: The Unicode Collation Algorithm specifies how collation works for a single context. In this respect, mixed Japanese and Chinese are no different than mixed Swedish and German, or any other languages that use the same characters. Generally, the customers using a particular collation will want text sorted uniformally, no matter what the source. Japanese customers would want them sorted in the Japanese fashion, etc. There are contexts where foreign words are called out separately and sorted in a separate group with different collation conventions. Such cases would require the source fields to be tagged with the type of desired collation (or tagged with a language, which is then used to look up an associated collation).

Q: Are the half-width katakana properly interleaved with the full-width?

A: Yes, the Default Unicode Collation Element Table properly interleaves half-width katakana, full-width katakana, and full-width hiragana. They also interleave the voicing and semi-voicing marks correctly, whether they are precomposed or not.

Q: How are names in a database sorted properly?

A: An important feature of international sorting is that it makes a big difference whether strings are considered in isolation or not. Suppose that your database is sorted first by last name, then by first name. Since they are sorted first, a secondary or tertiary difference in the last name will completely swamp a primary difference in the first name. So "Zelda Casares" will sort before "Albert Cásares". If this behavior is not desired, then the database should be sorted by a constructed field which contains last name + ',' + first name. This will end up sorting the record with "Cásares, Albert" before the one with "Casares, Zelda".

Q: Can the katakana length mark be handled properly, as described in Section 5.XX of the Unicode Standard?

A: Yes, by using a combination of contraction and expansion, the length mark can be sorted differently, according to the vowel of the previous katakana character.

Q: Is this done in the Default Unicode Collation Element Table?

A: No, the Default Unicode Collation Element Table does not currently have any contractions.


Copyright

Copyright © 1998-1999 Unicode, Inc. All Rights Reserved.

The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions. Java is a trademark of Sun Microsystems, Inc.