[Unicode]  Frequently Asked Questions Home | Site Map | Search

Collation

Q: My script does not sort right because the characters were assigned to Unicode code points in the wrong order. What can I do about that?

A: There is a misunderstanding here: Linguistically meaningful sorting is done not by comparing code point values (an approach which would fail even for English), but by assigning multi-level weights to characters or sequences of characters and then comparing those weights on each level. There are many algorithms and implementations for this; the standard Unicode Collation Algorithm (UCA) comes with a default weight table for all assigned characters as well as a tailoring mechanism that describes how this table can be modified to conform to local conventions, where necessary. [MS]

Q: How should collations be made available?

A: Ideally, people should be able to specify a collation order for any set of data returned by a database query and sorted by a SQL 'ORDER BY' clause. Actual database implementations may differ in how they surface the choices of collations to users. Differing collations should also be specifiable for any comparison (e.g. s1 < s2) of strings, unless a strictly binary order comparison is intended. People should also be able to use collations for doing loose matching, and string searching. For more information, see: http://www.unicode.org/reports/tr10/

Q: Where can I find out more information on how Java does collation?

A: Search for the RuleBasedCollator class at http://java.sun.com/j2se/1.4.2/docs/api/. For a C/C++ version, you can also look at: http://www.icu-project.org/.

Q: How shall the collation to be used be specified, taking into account current implementations.

A: To specify a collation, clients should be either able to specify a locale (e.g. collate as in "de_DE") or tailoring rules (as in Java or ICU) or both. Java and ICU also allow merging: e.g. French + Arabic + tailoring.

Q: UTS #10 Unicode Collation Algorithm is defined for a particular version of the Unicode Standard, but I am using characters from a later version of Unicode. What shall I do?

A: You can update to a later version of the Unicode Collation Algorithm, which will be synchronized with a later version of the Unicode Standard. The UTC is committed to ensuring that the Unicode Collation Algorithm is updated in a timely manner, so that the repertoire of characters in the Default Unicode Collation Element Table stays in synch with the Unicode Standard. However, if you need to stay with a particular version of the Unicode Collation Algorithm for any reason, such as maintaining binary compatibility of generated key weights, note that the algorithm does assign a default sorting order to every valid code point, assigned or unassigned. Any characters that are not assigned in the repertoire for that version will be given derived, implicit weights in code point order after all of the assigned characters.  See 7.1 Derived Collation Elements for more details.

Q: Is transitive consistency maintained by the UCA?

A: Yes, for any strings A, B, and C, if A < B and B < C, then A < C. However, implementers must be careful to produce implementations that accurately reproduce the results of the Unicode Collation Algorithm as they optimize their own algorithms. It is easy to perform careless optimizations — especially with Incremental Comparison algorithms — that fail this test. Other items to check are the proper distinction between the bases of accents. For example, the sequence <u-macron, u-diaeresis-macron> should compare as less than <u-macron-diaeresis, u-macron>; this is a secondary distinction, based on the weighting of the accents, which must be correctly associated with the primary weights of their respective base letters.

Q: Does JIS require tailorings?

A: The Default Unicode Collation Element Table uses the Unicode order for CJK ideographs (Kanji). This represents a radical-stroke ordering for the characters in JIS levels 1 and 2. If a different order is needed, such as an exact match to binary JIS order for these characters, that can be achieved with tailoring.

Q: How are Hiragana readings handled for Kanji?

A: There is no algorithmic mapping from Kanji characters to the phonetic readings for those characters, because there is too much linguistic variation. The common practice for sorting in a database by reading is to store the reading in a separate field, and construct the sort keys from the readings.

Q: How are mixed Japanese and Chinese handled?

A: The Unicode Collation Algorithm specifies how collation works for a single context. In this respect, mixed Japanese and Chinese are no different than mixed Swedish and German, or any other languages that use the same characters. Generally, the customers using a particular collation will want text sorted uniformly, no matter what the source. Japanese customers would want them sorted in the Japanese fashion, etc. There are contexts where foreign words are called out separately and sorted in a separate group with different collation conventions. Such cases would require the source fields to be tagged with the type of desired collation (or tagged with a language, which is then used to look up an associated collation).

Q: Are the half-width katakana properly interleaved with the full-width?

A: Yes, the Default Unicode Collation Element Table properly interleaves half-width katakana, full-width katakana, and full-width hiragana. It also interleaves the voicing and semi-voicing marks correctly, whether they are precomposed or not.

Q: Can the katakana length mark be handled properly?

A: Yes, by using a combination of contraction and expansion, the length mark can be tailored to sort according to the vowel of the previous katakana character. For a description of the phenomenon involved and how to handle it, see Contextual Sensitivity.

Q: How are names in a database sorted properly?

A: In international sorting, it will make a difference whether strings in one field are sorted first and strings in a second field are sorted subsequently, or whether a single sort is done considering both fields together. This is because international sorting uses multi-level comparison of differences in strings. Suppose that your database is sorted first by family name, then by given name. Since family names are sorted first, a secondary or tertiary difference in the family name will completely swamp a primary difference in the given name. So {field1=Casares, field2=Zelda} will sort before {field1=Cásares, field2=Albert}.

This is not the typically desired behavior. The database should be sorted by a constructed field which contains family name + <separator> + given name. Typical historical practice was to use a ',' as the separator. However, that does not work for collation sequences that ignore punctuation. A better option, which is in CLDR 1.9 or later, is to use U+FFFE as this separator. CLDR tailors this code point to sort before any other base character, for exactly this purpose, so that the record with {field1=Cásares, field2=Albert} sorts before the record with {field1=Casares, field2=Zelda}.

For more information on this topic, see Interleaved Levels.

Q: How can I use the Unicode Collation Algorithm for a stable sort?

A: A stable sort is one where identical records come out in the same order as they were originally in. To achieve this, the easiest way is to append an index number for each record to the sort key for that record. Whether that sort key comes from strings, other data, or a concatenation of sort keys, it will then produce a stable sort. Further information about stable sorts and related topics can be found in Deterministic Sorting.

Q. What are the differences between the UCA and ISO 14651?

A. Very broadly, the UCA includes the following features that are not part of ISO 14651. This is only a sketch; for details see http://www.unicode.org/reports/tr10/.

  • a much more thorough introduction to multilingual sorting issues

  • much more information about performance and implementation practices

  • how to apply collation to searching and matching

  • uniform handling of canonical equivalents

  • variable weighting (allowing punctuation to be ignored or not)

  • irrelevant combining characters don't interfere with contractions

  • well-formedness criteria for tables (disallowing tables that would produce peculiar results, e.g. where X and Y don't contract, X < Y and yet XY == YX)

Q. What can you tell me about searching and sorting with Braille?

A. The individual Braille patterns are not tied to specific characters. A pattern that represents an "A" for English might represent a completely different letter or symbol or ideograph for another language. Therefore, search and sort engines cannot assume that the underlying meaning of any individual Braille pattern is fixed. It can and will vary by language, greatly affecting how searching and sorting rules are defined, and how strings that contain Braille patterns are interpreted. [SO]

Q. In my language, "ch" usually sorts like a separate letter. If I want a foreign word to sort without this happening, how do I do it?

A. You use the CGJ, as described in Characters and Combining Marks.

Q. What policies constrain allowable changes to UCA between versions?

A. The UTC has established a number of policies which help to keep the UCA and its associated data table (DUCET) stable, even as the UCA is updated to stay in synch with additions to the Unicode Standard. First there are policies which define how collation weights should be established for newly assigned characters and scripts. Those can be found in UCA Default Criteria for New Characters. There are also policies which limit the kinds of changes which can be made for characters already in the DUCET, and which define how potential updates should be specified and tracked. Those can be found in Change Management for the Unicode Collation Algorithm.