[Unicode]   Collation Home | Site Map | Search
 

Collation

One of the most common things that processes implementing the Unicode Standard need to do is compare Unicode strings. As for ASCII or any other character encoding, Unicode strings can simply be compared by their binary code values, but for linguistically relevant string comparisons, more sophisticated comparison is necessary, taking into account casing and accents, ignoring certain characters, and so on.

The Unicode Consortium has published a separate standard devoted specifically to this issue of string comparison, or collation: UTS #10, Unicode Collation Algorithm (UCA). That algorithm provides a complete specification of how to generate collation keys for Unicode strings. Those collation keys can then be compared directly in order to make determinations about the comparison of the Unicode strings they were generated from. Collation keys can also be used in string matching and searching operations.

Committees Responsible for Collation

The Unicode Technical Committee is responsible for the maintenance of both the Unicode Collation Algorithm and the Default Unicode Collation Element Table (DUCET) which provides all the basic collation key weighting information used by the algorithm.

The CLDR Technical Committee is responsible for maintaining information about language-specific tailoring of the Unicode Collation Algorithm—for example, a Swedish-specific collation, a Czech-specific collation, and so forth. Such information is specified, using the CLDR tailoring syntax, in the Common Locale Data Repository (CLDR).

Policies Regarding Collation

The UTC has defined detailed policies that it uses in the maintenance of the DUCET table for the Unicode Collation Algorithm.

The first set of policies covers constraints on how the existing DUCET table can be changed. Those can be found in Change Management for the Unicode Collation Algorithm.

The second set of policies specifies criteria by which initial collation weights are assigned to characters newly added to the Unicode Standard. Those can be found in UCA Default Table Criteria for New Characters.