[Unicode]  Frequently Asked Questions Home | Site Map | Search

Collation

Q: My script does not sort right because the characters were assigned to Unicode code points in the wrong order. What can I do about that?

A: There is a misunderstanding here: Linguistically meaningful sorting is done not by comparing code point values (an approach which would fail even for English), but by assigning multi-level weights to characters or sequences of characters and then comparing those weights on each level. There are many algorithms and implementations for this; the standard Unicode Collation Algorithm (UCA) comes with a default weight table for all assigned characters as well as a tailoring mechanism that describes how this table can be modified to conform to local conventions, where necessary. [MS]

Q: How should collations be made available?

A: Ideally, people should be able to specify a collation order for any set of data returned by a database query and sorted by a SQL 'ORDER BY' clause. Actual database implementations may differ in how they surface the choices of collations to users. Differing collations should also be specifiable for any comparison (e.g. s1 < s2) of strings, unless a strictly binary order comparison is intended. People should also be able to use collations for doing loose matching, and string searching. For more information, see: http://www.unicode.org/reports/tr10/ [MD]

Q: Where can I find out more information on how Java does collation?

A: Search for the RuleBasedCollator class at http://java.sun.com/j2se/1.4.2/docs/api/. For a C/C++ version, you can also look at: http://www.ocu-project/org/. [MD]

Q: How shall the collation to be used be specified, taking into account current implementations.

A: To specify a collation, clients should be either able to specify a locale (e.g. collate as in "de_DE") or tailoring rules (as in Java or ICU) or both. Java and ICU also allow merging: e.g. French + Arabic + tailoring. [MD]

Q: UTS #10 Unicode Collation Algorithm is defined with a particular base version of the Unicode Standard, but I am using characters from a later version of Unicode. What shall I do?

A: You can update to a later version of the Unicode Collation Algorithm, which will update its base version to the latest version of the Unicode Standard itself. The UTC is committed to ensuring that the Unicode Collation Algorithm is updated in a timely manner, so that the repertoire of characters in the Default Unicode Collation Element Table stays in synch with the Unicode Standard. However, if you need to stay with a particular version of the Unicode Collation Algorithm for any reason, such as maintaining binary compatibility of generated key weights, note that the algorithm does assign a default sorting order to every valid code point, assigned or unassigned. Any characters that are not defined in the base version repertoire will be given derived, implicit weights in code point order after all of the assigned characters.  See 7.1 Derived Collation Elements for more details. [MD] & [KW]

Q: Is transitive consistency maintained by the UCA?

A: Yes, for any strings A, B, and C, if A < B and B < C, then A < C. However, implementers must be careful to produce implementations that accurately reproduce the results of the Unicode Collation Algorithm as they optimize their own algorithms. It is easy to perform careless optimizations — especially with Incremental Comparison algorithms — that fail this test. Other items to check are the proper distinction between the bases of accents. For example, the sequence <u-macron, u-diaeresis-macron> should compare as less than <u-macron-diaeresis, u-macron>; this is a secondary distinction, based on the weighting of the accents, which must be correctly associated with the primary weights of their respective base letters. [MD]

Q: Does JIS require tailorings?

A: The Default Unicode Collation Element Table uses the Unicode order for CJK ideographs (Kanji). This represents a radical-stroke ordering for the characters in JIS levels 1 and 2. If a different order is needed, such as an exact match to binary JIS order for these characters, that can be achieved with tailoring. [MD]

Q: How are Hiragana readings handled for Kanji?

A: There is no algorithmic mapping from Kanji characters to the phonetic readings for those characters, because there is too much linguistic variation. The common practice for sorting in a database by reading is to store the reading in a separate field, and construct the sort keys from the readings. [MD]

Q: How are mixed Japanese and Chinese handled?

A: The Unicode Collation Algorithm specifies how collation works for a single context. In this respect, mixed Japanese and Chinese are no different than mixed Swedish and German, or any other languages that use the same characters. Generally, the customers using a particular collation will want text sorted uniformly, no matter what the source. Japanese customers would want them sorted in the Japanese fashion, etc. There are contexts where foreign words are called out separately and sorted in a separate group with different collation conventions. Such cases would require the source fields to be tagged with the type of desired collation (or tagged with a language, which is then used to look up an associated collation). [MD]

Q: Are the half-width katakana properly interleaved with the full-width?

A: Yes, the Default Unicode Collation Element Table properly interleaves half-width katakana, full-width katakana, and full-width hiragana. It also interleaves the voicing and semi-voicing marks correctly, whether they are precomposed or not. [MD]

Q: Can the katakana length mark be handled properly?

A: Yes, by using a combination of contraction and expansion, the length mark can be tailored to sort according to the vowel of the previous katakana character. For a description of the phenomenon involved and how to handle it, see Contextual Sensitivity [MD]

Q: How are names in a database sorted properly?

A: In international sorting, which depends on multi-level comparison of differences in strings, it will make a difference whether strings in one field are sorted first and strings in a second field are sorted subsequently, or whether a single sort is done based on considering both fields together. Suppose that your database is sorted first by last name, then by first name. Since they are sorted first, a secondary or tertiary difference in the last name will completely swamp a primary difference in the first name. So "Zelda Casares" will sort before "Albert Cásares". If this behavior is not desired, then the database should be sorted by a constructed field which contains last name + ',' + first name. This will end up sorting the record with "Cásares, Albert" before the one with "Casares, Zelda". [MD]

Q: How can I use the Unicode Collation Algorithm for a stable sort?

A: A stable sort is one where identical records come out in the same order as they were originally in. To achieve this, the easiest way is to append an index number for each record to the sort key for that record. Whether that sort key comes from strings, other data, or a concatenation of sort keys, it will then produce a stable sort. Further information about stable sorts and related topics can be found in Deterministic Sorting. [MD]

Q. What are the differences between the UCA and ISO 14651?

A. Very broadly, the UCA includes the following features that are not part of ISO 14651. This is only a sketch; for details see http://www.unicode.org/reports/tr10/.

  • a much more thorough introduction to multilingual sorting issues

  • much more information about performance and implementation practices

  • how to apply collation to searching and matching

  • uniform handling of canonical equivalents

  • automatic rearrangement for Thai, Lao

  • variable weighting (allowing punctuation to be ignored or not)

  • the completely ignorable characters and irrelevant combining characters don't interfere with contractions

  • well-formedness criteria for tables (disallowing tables that would produce peculiar results, e.g. where X and Y don't contract, X < Y and yet XY == YX) [MD]

Q. What can you tell me about searching and sorting with Braille?

A. The individual Braille patterns are not tied to specific characters. A pattern that represents an "A" for English might represent a completely different letter or symbol or ideograph for another language. Therefore, search and sort engines cannot assume that the underlying meaning of any individual Braille pattern is fixed. It can and will vary by language, greatly affecting how searching and sorting rules are defined, and how strings that contain Braille patterns are interpreted. [SO]

Q. In my language, "ch" usually sorts like a separate letter. If I want a foreign word to sort without this happening, how do I do it?

A. You use the CGJ, as described in Characters and Combining Marks.