[Unicode]   Collation Home | Site Map | Search
 

UCA Default Table Criteria for New Characters

This page explains the criteria which the UTC uses in deciding how to create initial orderings for the large collections of new characters added to the Default Unicode Collation Element Table (DUCET) for each new minor or major version of the Unicode Standard. See Unicode Collation Algorithm for information about the algorithm itself and technical details regarding the format and use of the DUCET.

Criteria for Ordering New Scripts

1. When a new script is added to the standard, the establishment of its primary ordering should, as much as possible, be based on information provided with the Summary Proposal Form and other supporting documents for the proposed encoding.

2. Failing that, or given ambiguity in the proposal documentation, primary ordering should be based on whatever lexicographical evidence can be gathered for the language which is either the best documented and/or in most widespread use for that script.

3. If a script is in multilingual use and has character extensions provided for specific languages, then following the choice of primary order for the first language (by criterion 2), weights for character extensions should be interpolated so as to get the ordering for other languages (if known) as correct as possible without requiring tailoring.

4. If characters with accents are included, then the accents should be given secondary weights unless overriding concerns based on established practice for primary letter weighting dictate otherwise.

5. If characters with distinctions comparable to case are included, then the case (or presentation form) differences should be given tertiary weights unless overriding concerns based on established practice for ordering dictate otherwise.

6. Weighting for digits, symbols, and punctuation in a new script should, as much as possible, follow the established patterns in the DUCET for other scripts, so as not to introduce idiosyncratic treatments of such characters on a script-by-script basis.

7. In some instances—particularly for historic scripts—there may be no established native lexicographical order, or none documented well enough to be usable. In such cases, a primary order based simply on code point order in the charts or, alternatively, based on a well-known academic catalog order for the characters, may be an acceptable alternative for placing the characters in the DUCET.

8. The impact on the overall size and complexity of the DUCET also needs to be considered when adding collation weights for a new script. Particularly complex approaches to the specific weighting for a new script should be avoided if they would have a significant impact on the table's use for all other scripts and languages, even if that approach might produce a marginally better default ordering for the new script.

Criteria for Addition of Small Numbers of Characters to Existing Collections

9. As much as possible, when adding additional characters to scripts (or other collections) already in the DUCET—as, for example, adding small numbers of additional Latin, Cyrillic, or Arabic characters—weights for such characters should be interpolated in the table following the predominant principles of ordering already established in the table for that script. This is to minimize the chances that such characters will simply get lost in the table by being ordered in some haphazard, ad hoc manner for the script. (Thus if a z-like character with some overlay diacritic is added to the table, it should be weighted as much as possible like other z-like characters with diacritics.)

10. In most instances, characters added after the fact for a script, in support of some small, minority language use or specialized orthography, will be added in full knowledge that a tailoring of the DUCET will be necessary in order to support ordering for that language or specialized orthography. However, in certain, limited cases, it may be appropriate to attempt to place such an additional character in a primary order other than would be chosen by criterion 9, if it is known that that character is used only for that language or specialized orthography. Such exceptions should, however, be just that—exceptions rather than usual cases.

11. When additional characters have formal decomposition mappings in the standard, their collation weight should simply be derived automatically from the decomposition, unless there is a clear, overriding reason to do otherwise. This is because overriding the decomposition in all cases marginally complicates the process of regenerating the DUCET, may often introduce unanticipated edge cases or interactions with other weights, and seldom is sufficient to produce a "perfect" ordering.

12. Additional sets of punctuation or other symbols that fall into clear classes that have been grouped together in the DUCET should be grouped, as much as possible, with like characters already present in the DUCET. Thus if a new quotation mark of some sort is added, it should be grouped with the existing batch of quotation marks in the table. This eases maintenance and will make sense for some kinds of ordering, even though for most lexicographical sorting, punctuation and such symbols are basically ignored.

13. Other symbols should simply default to getting weights based on the code point order, along with the existing collection of otherwise unclassified symbols.