[Unicode]  Technical Reports
 

Unicode Technical Standard #35

Unicode Locale Data Markup Language (LDML)
Part 5: Collation

Version 23
Editors Mark Davis (markdavis@google.com) and other CLDR committee members
Date 2013-03-15
This Version http://www.unicode.org/reports/tr35/tr35-31/tr35.html
Previous Version http://www.unicode.org/reports/tr35/tr35-29.html
Latest Version http://www.unicode.org/reports/tr35/
Corrigenda http://unicode.org/cldr/corrigenda.html
Latest Proposed Update http://www.unicode.org/reports/tr35/proposed.html
Namespace http://cldr.unicode.org/
DTDs http://unicode.org/cldr/dtd/23/
Revision 31

Summary

This document describes parts of an XML format (vocabulary) for the exchange of structured locale data. This format is used in the Unicode Common Locale Data Repository.

This is a partial document, describing only those parts of the LDML that are relevant for collation (sorting, searching & grouping). For the other parts of the LDML see the main LDML document and the links above.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Please submit corrigenda and other comments with the CLDR bug reporting form [Bugs]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents of Part 5, Collation

1 CLDR Collation

Collation is the general term for the process and function of determining the sorting order of strings of characters, for example for lists of strings presented to users, or in databases for sorting and selecting records.

Collation varies by language, by application (some languages use special phonebook sorting), and other criteria (for example, phonetic vs. visual).

CLDR provides collation data for many languages and styles. The data supports not only sorting but also language-sensitive searching and grouping under index headers. All CLDR collations are based on the [UCA] default order, with common modifications applied in the CLDR root collation, and further tailored for language and style as needed.

2 Root Collation

The CLDR root collation order is based on the UCA default table defined in UTS #10: Unicode Collation Algorithm [UCA]. It is used by all other locales by default, or as the base for their tailorings. (For a chart view of the UCA, see Collation Chart [UCAChart].)

Starting with CLDR 1.9, CLDR uses modified tables for the root collation order

The root locale ordering is tailored in the following ways:

Grouping classes of characters

As of Version 6.1.0, the DUCET puts characters into the following ordering:

(There are a few exceptions to this general ordering.)

The CLDR root locale modifies the DUCET tailoring by ordering the common characters more strictly by category:

What the regrouping allows is for users to parametrically reorder the groups. For example, users can reorder numbers after all scripts, or reorder Greek before Latin.

The relative order within each of these groups still matches the DUCET. Symbols, punctuation, and numbers that are grouped with a particular script stay with that script. The differences between CLDR and the DUCET order are:

  1. CLDR groups the numbers together after currency symbols, instead of splitting them with some before and some after. Thus the following are put after currencies and just before all the other numbers.

    U+09F4 ( ৴ ) [No] BENGALI CURRENCY NUMERATOR ONE
    ...
    U+1D371 ( 𝍱 ) [No] COUNTING ROD TENS DIGIT NINE

  2. CLDR handles a few other characters differently
    1. U+10A7F ( 𐩿 ) [Po] OLD SOUTH ARABIAN NUMERIC INDICATOR is put with punctuation, not symbols
    2. U+20A8 ( ₨ ) [Sc] RUPEE SIGN and U+FDFC ( ﷼ ) [Sc] RIAL SIGN are put with currency signs, not with R and REH.

Non-variable symbols

There are multiple Variable-Weighting options in the UCA for symbols and punctuation, including non-ignorable and shifted. With the shifted option, almost all symbols and punctuation are ignored—except at a fourth level. The CLDR root locale ordering is modified so that symbols are not affected by the shifted option. That is, by default, symbols are not “variable” in CLDR. So shifted only causes whitespace and punctuation to be ignored, but not symbols (like ♥). The DUCET behavior can be specified with a locale ID using the "vt" keyword, to set the Variable section to include all of the symbols below it, or be set parametrically where implementations allow access.

See also:

Additional contractions for Tibetan

Ten contractions are added for Tibetan: Two to fulfill well-formedness condition 5, and eight more to preserve the default order for Tibetan. For details see UTS #10, Section 3.6.4, Well-Formedness of the DUCET.

Tailored noncharacter weights

U+FFFE and U+FFFF have special tailorings:

U+FFFF: This code point is tailored to have a primary weight higher than all other characters. This allows the reliable specification of a range, such as “Sch” ≤ X ≤ “Sch\uFFFF”, to include all strings starting with "sch" or equivalent.

U+FFFE: This code point produces a CE with special minimal weights on all levels, regardless of alternate handling. This allows for Merging Sort Keys within code point space. For example, when sorting names in a database, a sortable string can be formed with last_name + '\uFFFE' + first_name. These strings would sort properly, without ever comparing the last part of a last name with the first part of another first name.

In CLDR, so as to maintain the special collation elements, U+FFFE..U+FFFF are not further tailorable, and nothing can tailor to them. That is, neither can occur in a collation rule: for example, the following rules are illegal:

& \uFFFF < x

& x <\uFFFF

Note:

2.1 Root Collation Data Files

The CLDR root collation data files are in the CLDR repository and release, under the path common/uca/.

Comments with DUCET-style weights in files other than allkeys_CLDR.txt use the weights defined in allkeys_CLDR.txt.

2.2 Root Collation Data File Formats

The file formats may change between versions of CLDR. The formats for CLDR 23 and beyond are as follows. As usual, text after a # is a comment.

allkeys_CLDR.txt

This file defines CLDR’s tailoring of the DUCET, as described in Section 2, Root Collation.

The format is similar to that of allkeys.txt, although there may be some differences in whitespace.

FractionalUCA.txt

The format is illustrated by the following sample lines, with commentary afterwards.

[UCA version = 6.0.0]

Provides the version number of the UCA table.

0000; [,,]     # Zyyy Cc       [0000.0000.0000]        * <NULL>

Provides a weight line. The first element (before the ";") is a hex codepoint sequence. The second field is a sequence of collation elements. Each collation element has 3 parts separated by commas: the primary weight, secondary weight, and tertiary weight. The tertiary weight actually consists of two components: the top two bits (0xC0) are used for the case level, and should be masked off where a case level is not used.

A weight is either empty (meaning a zero or ignorable weight) or is a sequence of one or more bytes. The bytes are interpreted as a "fraction", meaning that the ordering is 04 < 05 05 < 06. The weights are constructed so that no weight is an initial subsequence of another: that is, having both the weights 05 and 05 05 is illegal. The above line consists of all ignorable weights.

The vertical bar (“|”) character is used to indicate context, as in:

006C | 00B7; [, DB A9, 05]
This example indicates that if U+00B7 appears immediately after U+006C, it is given the corresponding collation element instead. This syntax is equivalent to the following contraction, but is more efficient.
006C 00B7; CE(006C) [, DB A9, 05]

Single-byte primary weights are given to particularly frequent characters, such as space, digits, and a-z. Most characters are given two-byte weights, while relatively infrequent characters are given three-byte weights. For example:

...
0009; [03 05, 05, 05] # Zyyy Cc       [0100.0020.0002]        * <CHARACTER TABULATION>
...
1B60; [06 14 0C, 05, 05]    # Bali Po       [0111.0020.0002]        * BALINESE PAMENENG
...
0031; [14, 05, 05]    # Zyyy Nd       [149B.0020.0002]        * DIGIT ONE

The assignment of 2 vs 3 bytes does not reflect importance, or exact frequency.

# SPECIAL MAX/MIN COLLATION ELEMENTS
FFFE; [02, 02, 02]     # Special LOWEST primary, for merge/interleaving
FFFF; [EF FE, 05, 05]  # Special HIGHEST primary, for ranges

The two tailored noncharacters have their own weights.

# SPECIAL FINAL VALUES for Script Reordering
FDD0 0042; [05 FE, 05, 05]     # Special final value for reordering token
FDD0 0043; [0C FE, 05, 05]     # Special final value for reordering token

There are special values assigned to code point sequences FDD0+X. These sequences are simply used to communicate special values, and can be eliminated. For the reordering values, the purpose is to make sure that there is a "high" weight at the end of each reordering group.

...
# HOMELESS COLLATION ELEMENTS
FDD0 0063; [, 97, 3D]       # [15E4.0020.0004] [1844.0020.0004] [0000.0041.001F]    * U+01C6 LATIN SMALL LETTER DZ WITH CARON
FDD0 0064; [, A7, 09]       # [15D1.0020.0004] [0000.0056.0004]     * U+1DD7 COMBINING LATIN SMALL LETTER C CEDILLA
FDD0 0065; [, B1, 09]       # [1644.0020.0004] [0000.0061.0004]     * U+A7A1 LATIN SMALL LETTER G WITH OBLIQUE STROKE

The DUCET has some weights that don't correspond directly to a character. To allow for implementations to have a character associated with each weight (necessary for certain implementations of tailoring), this requires the construction of special sequences for those weights.

Next, a number of tables are defined. The function of each of the tables is summarized afterwards.

# VALUES BASED ON UCA
...
[first regular [0D 0A, 05, 05]] # U+0060 GRAVE ACCENT
[last regular [7A FE, 05, 05]] # U+1342E EGYPTIAN HIEROGLYPH AA032
[first implicit [E0 04 06, 05, 05]] # CONSTRUCTED
[last implicit [E4 DF 7E 20, 05, 05]] # CONSTRUCTED
[first trailing [E5, 05, 05]] # CONSTRUCTED
[last trailing [E5, 05, 05]] # CONSTRUCTED
...

This table summarizes ranges of important groups of characters for implementations.

# Top Byte => Reordering Tokens
[top_byte     00      TERMINATOR ]    #       [0]     TERMINATOR=1
[top_byte     01      LEVEL-SEPARATOR ]       #       [0]     LEVEL-SEPARATOR=1
[top_byte     02      FIELD-SEPARATOR ]       #       [0]     FIELD-SEPARATOR=1
[top_byte     03      SPACE ] #       [9]     SPACE=1 Cc=6 Zl=1 Zp=1 Zs=1
...

This table maps from the first bytes of the fractional weights to a reordering token. The format is "[top_byte " byte-value reordering-token "COMPRESS"? "]". The "COMPRESS" value is present when there is only one byte in the reordering token, and primary-weight compression can be applied. Most reordering tokens are script values; others are special-purpose values, such as PUNCTUATION.

# Reordering Tokens => Top Bytes
[reorderingTokens     Arab    61=910 62=910 ]
[reorderingTokens     Armi    7A=22 ]
[reorderingTokens     Armn    5F=82 ]
[reorderingTokens     Avst    7A=54 ]
...

This table is an inverse mapping from reordering token to top byte(s). In terms like "61=910", the first value is the top byte, while the second is informational, indicating the number of primaries assigned with that top byte.

# General Categories => Top Byte
[categories   Cc      03{SPACE}=6 ]
[categories   Cf      77{Khmr Tale Talu Lana Cham Bali Java Mong Olck Cher Cans Ogam Runr Orkh Vaii Bamu}=2 ]
[categories   Lm      0D{SYMBOL}=25 0E{SYMBOL}=22 27{Latn}=12 28{Latn}=12 29{Latn}=12 2A{Latn}=12...

This table is informational, providing the top bytes, scripts, and primaries associated with each general category value.

# FIXED VALUES
[fixed first implicit byte E0]
[fixed last implicit byte E4]
[fixed first trail byte E5]
[fixed last trail byte EF]
[fixed first special byte F0]
[fixed last special byte FF]

The final table gives certain hard-coded byte values. The "trail" area is provided for implementation of the "trailing weights" as described in the UCA.

UCA_Rules.txt

The format for this file uses the CLDR basic collation syntax, see Section 3, Collation Tailorings.

3 Collation Tailorings

<!ELEMENT collations (alias | (default*, collation*, special*)) >

This element of the LDML format contains one or more collation elements, distinguished by type. Each collation contains rules that specify a certain sort-order, as a tailoring of the root order.

To allow implementations in reduced memory environments to use CJK sorting, there are also short forms of each of these collation sequences. These provide for the most common characters in common use, and are marked with alt="short".

There are two syntaxes for specifying collation rules: the basic collation syntax and the XML collation syntax. Both have the same functionality. The LDML files use the XML format, but the basic format is simpler to read, and will often be used in examples. Implementations of LDML, such as [ICUCollation], may choose to use the basic collation syntax as their native syntax.

Note:

3.1 Version

The version attribute is used in case a specific version of the UCA is to be specified. It is optional, and is specified if the results are to be identical on different systems. If it is not supplied, then the version is assumed to be the same as the Unicode version for the system as a whole. In general, tailorings should be defined so as to minimize dependence on the underlying UCA version, by explicitly specifying the behavior of all characters used to write the language in question.

Note: For version 3.1.1 of the UCA, the version of Unicode must also be specified with any versioning information; an example would be "3.1.1/3.2" for version 3.1.1 of the UCA, for version 3.2 of Unicode. This was changed by decision of the UTC, so that dual versions were no longer necessary. So for UCA 4.0 and beyond, the version just has a single number.

3.2 Collation Element

<!ELEMENT collation (alias | (base?, settings?, suppress_contractions?, optimize?, rules?, special*)) >

The tailoring syntax is designed to be independent of the actual weights used in any particular UCA table. That way the same rules can be applied to UCA versions over time, even if the underlying weights change. The following illustrates the overall structure of a collation with the XML syntax:

<collation>
 <settings caseLevel="on"/>
 <rules>
  <reset>c<reset>

  <p>k</p>
 </rules>
</collation>

The basic syntax corresponding to this would be:

[caseLevel on]
& c < k

The optional base element <base>...</base>, contains an alias element that points to another data source that defines a base collation. If present, it indicates that the settings and rules in the collation are modifications applied on top of the respective elements in the base collation. That is, any successive settings, where present, override what is in the base as described in Setting Options. Any successive rules are concatenated to the end of the rules in the base. The results of multiple rules applying to the same characters is covered in Orderings.

3.3 Setting Options

In XML syntax, these are attributes of <settings>. For example, <setting strength="secondary"> will only compare strings based on their primary and secondary weights. In basic syntax, these are of the form [keyword value].

If the attribute is not present, the CLDR default (or the default for the locale, if there is one) is used. That default is listed in bold italics. Where there is a UCA default that is different, it is listed in bold with (UCA default). Note that the default value for a locale may be different than the default value for the attribute, so the defaults here are not defaults for the corresponding keywords.

The Example cells include an LDML example followed by the same example in basic syntax.

Collation Settings
BCP47 Key Attribute BCP47 Value Options Example  Description
ks strength level1 primary (1)
strength = "primary"

[strength 1]
Sets the default strength for comparison, as described in the [UCA]. Note that strength setting of greater than 4 may have the same effect as identical, depending on the locale and implementation.
level2 secondary (2)
level3 tertiary (3)
level4 quaternary (4)
identic identical (5)
ka alternate noignore non-ignorable
alternate = "non-ignorable"

[alternate non-ignorable]
Sets alternate handling for variable weights, as described in [UCA], where "shifted" causes certain characters to be ignored in comparison. The default for LDML is different than it is in the UCA. In LDML, the default for alternate handling is non-ignorable, while in UCA it is shifted. In addition, in LDML only whitespace and punctuation are variable.
shifted shifted (UCA default)
n/a blanked
kb backwards true on backwards = "on"

[backwards 2]
Sets the comparison for the second level to be backwards, as described in [UCA].
false off
kk normalization true on (UCA default) normalization = "off"

[normalization off]
If on, then the normal [UCA] algorithm is used. If off, then all strings that are in [FCD] and do not contain any composite combining marks will sort correctly, but others will not necessarily sort correctly.
So should only be set off if the the strings to be compared are in FCD and do not contain any composite combining marks. Composite combining marks are: { U+0344, U+0F73, U+0F75, U+0F81 } [[:^lccc=0:]&[:toNFD=/../:]] (These characters must be decomposed for discontiguous contractions to work properly. Use of these characters is discouraged by the Unicode Standard.)
Note that the default for CLDR locales may be different than in the UCA. The rules for particular locales have it set to on: those locales whose exemplar characters (in forms commonly interchanged) would be affected by normalization.
false off
kc caseLevel true on caseLevel = "off"

[caseLevel on]
If set to on, a level consisting only of case characteristics will be inserted in front of tertiary level, as a "Level 2.5". To ignore accents but take case into account, set strength to primary and case level to on. For details, see Section 3.13, Case Parameters.
false off
kf caseFirst upper upper caseFirst = "off"

[caseFirst off]
If set to upper, causes upper case to sort before lower case. If set to lower, causes lower case to sort before upper case. Useful for locales that have already supported ordering but require different order of cases. Affects case and tertiary levels. For details, see Section 3.13, Case Parameters.
lower lower
false off
kh hiraganaQuaternary true on hiragana­Quaternary = "on"

[hiraganaQ on]
Controls special treatment of Hiragana code points on quaternary level. If turned on, Hiragana codepoints will get lower values than all the other non-variable code points in shifted. That is, the normal Level 4 value for a regular collation element is FFFF, as described in [UCA], Section 3.6.2, Variable Weighting. This is changed to FFFE for [:script=Hiragana:] characters. The strength must be greater or equal than quaternary if this attribute is to have any effect.
false off
kn numeric true on numeric = "on"

[numeric on]
If set to on, any sequence of Decimal Digits (General_Category = Nd in the [UAX44]) is sorted at a primary level with its numeric value. For example, "A-21" < "A-123". The computed primary weights are all at the start of the digit reordering group. Thus with an untailored UCA table, "a$" < "a0" < "a2" < "a12" < "a⓪" < "aa".
false off
vt variableTop See Appendix Q: Locale Extension Keys and Types. uXXuYYYY

(the default is set to the highest punctuation, thus including spaces and punctuation, but not symbols)
variableTop = "uXXuYYYY"

& \u00XX\uYYYY < [variable top]

The Option value is an encoded Unicode string, with code points in hex, leading zeros removed, and 'u' inserted between successive elements.

The BCP47 value is described in Appendix Q: Locale Extension Keys and Types.

Sets the string value for the variable top. All the code points with primary strengths less than or equal to that string will be considered variable, and thus affected by the alternate handling. Variables are ignorable by default in [UCA], but not in CLDR. See below for more information.

kr reorder a sequence of one or more reorder codes: space, punct, symbol, currency, digit, or any BCP47 script ID reorder = "Grek digit"

[reorder Grek digit]
Specifies a reordering of scripts or other significant blocks of characters such as symbols, punctuation, and digits. For the precise meaning and usage of the reorder codes, see Section 3.12, Collation Reordering.
n/a match-boundaries: n/a none
match-boundaries = "whole-word"

n/a
Defined by Section 8, Searching and Matching of [UCA].
n/a whole-character
n/a whole-word
n/a match-style n/a minimal
match-style = "medial"

n/a
Defined by Section 8, Searching and Matching of [UCA].
n/a medial
n/a maximal

Variable Top (vt) bears more explanation. Users may want to include more or fewer characters as Variable. For example, someone could want to restrict the Variable characters to just include space marks. In that case, variableTop would be set to U+1680 (see UCA Variable chart). Alternatively, someone could want more of the Common characters in them, and include characters up to (but not including '0'), by setting variableTop to be U+20BA (in Unicode 6.2; see UCA Common chart).

The effect of these settings is to customize to ignore different sets of characters when comparing strings. For example, the locale identifier "de-u-ka-shifted-vt-0024" is requesting settings appropriate for German, including German sorting conventions, and that '$' and characters sorting below it are ignored in sorting.

3.4 Collation Rule Syntax

<!ELEMENT rules (alias | ( reset, ( reset | p | pc | s | sc | t | tc | i | ic | x)* )) >

The goal for the collation rule syntax is to have clearly expressed rules with a concise format, that parallels the basic syntax as much as possible.  The rule syntax uses abbreviated element names for primary (level 1), secondary (level 2), tertiary (level 3), and identical, to be as short as possible. The reason for this is because the tailorings for CJK characters are quite large (tens of thousands of elements), and the extra overhead would have been considerable. Other elements and attributes do not occur as frequently, and have longer names.

Note: The rules are stated in terms of actions that cause characters to change their ordering relative to other characters. This is for stability; assigning characters specific weights would not work, since the exact weight assignment in UCA (or ISO 14651) is not required for conformance — only the relative ordering of the weights. In addition, stating rules in terms of relative order is much less sensitive to changes over time in the UCA itself.

3.5 Orderings

The following are the normal ordering actions used for the bulk of characters. Each rule contains a string of ordered characters that starts with an anchor point or a reset value. The reset value is an absolute point in the UCA that determines the order of other characters. For example, the rule & a < g, places "g" after "a" in a tailored UCA: the "a" does not change place. Logically, subsequent rule after a reset indicates a change to the ordering (and comparison strength) of the characters in the UCA. For example, the UCA has the following sequence (abbreviated for illustration):

... a <3 a <3 ⓐ <3 A <3 A <3 Ⓐ <3 ª <2 á <3 Á <1 æ <3 Æ <1 ɐ <1 ɑ <1 ɒ <1 b <3 b <3 ⓑ <3 B <3 B <3 ℬ ...

Whenever a character is inserted into the UCA sequence, it is inserted at the first point where the strength difference will not disturb the other characters in the UCA. For example, & a < g puts g in the above sequence with a strength of L1. Thus the g must go in after any lower strengths,  as follows:

... a <3 a <3 ⓐ <3 A <3 A <3 Ⓐ <3 ª <2 á <3 Á <1 g <1 æ <3 Æ <1 ɐ <1 ɑ <1 ɒ <1 b <3 b <3 ⓑ <3 B <3 B <3 ℬ ...

The rule & a << g, which uses a level-2 strength, would produce the following sequence:

... a <3 a <3 ⓐ <3 A <3 A <3 Ⓐ <3 ª <2 g <2 á <3 Á <1 æ <3 Æ <1 ɐ <1 ɑ <1 ɒ <1 b <3 b <3 ⓑ <3 B <3 B <3 ℬ ...

And the rule & a <<< g, which uses a level-3 strength, would produce the following sequence:

... a <3 g <3 a <3 ⓐ <3 A <3 A <3 Ⓐ <3 ª <2 á <3 Á <1 æ <3 Æ <1 ɐ <1 ɑ <1 ɒ <1 b <3 b <3 ⓑ <3 B <3 B <3 ℬ ...

Since resets always work on the existing state, the rule entries must be in the proper order. A character or sequence may occur multiple times; each subsequent occurrence causes a different change. The following shows the result of serially applying a three rules.

  Basic Syntax  Result Comment 
1 & a < g ... a <1 g ... Put g after a.
2 & a < h < k ... a <1 h <1 k <1 g ... Now put h and k after a (inserting before the g).
3 & h << g ... a <1 h <1 g <1 k ... Now put g after h (inserting before k).

Notice that characters can occur multiple times, and thus override previous rules.

Except for the case of expansion sequence syntax, every sequence after a reset is equivalent in action to breaking up the sequence into an atomic rule: a reset + relation pair. The tailoring is then equivalent to applying each of the atomic rules to the UCA in order, according to the above description.

Example:

Basic Syntax Equivalent Atomic Rules
& b < q <<< Q
& a < x <<< X << q <<< Q < z
& b < q
& q <<< Q
& a < x
& x <<< X
& X << q
& q <<< Q
& Q < z

In the case of expansion sequence syntax, the equivalent atomic sequence can be derived by first transforming the expansion sequence syntax into normal expansion syntax. (See Expansions.)

<!ELEMENT reset ( #PCDATA | cp | ... )* >
<!ELEMENT p ( #PCDATA | cp | last_variable )* >
(Elements pc, s, sc, t, tc, i, and ic have the same structure as p.)

Specifying Collation Ordering
Basic Symbol Basic Example XML Symbol XML Example Description
& Z  <reset> <reset>Z</reset> Do not change the ordering of Z, but place subsequent characters relative to it.
& a
< b 
<p> <reset>a<reset>
<p>b</p>
Make 'b' sort after 'a', as a primary (base-character) difference
<<  & a
<< ä 
<s> <reset>a<reset>
<s>ä</s>
Make 'ä' sort after 'a' as a secondary (accent) difference
<<<  & a
<<< A 
<t> <reset>a<reset>
<t>A</t>
Make 'A' sort after 'a' as a tertiary (case/variant) difference
& v
= w 
<i> <reset>v<reset>
<i>w</i>
Make 'w' sort identically to 'v'

Resets only need to be at the start of a sequence, to position the characters relative a character that is in the UCA (or has already occurred in the tailoring). For example: <reset>z</reset><p>a</p><p>b</p><p>c</p><p>d</p>.

Some additional elements are provided to save space with large tailorings. The addition of a 'c' to the element name indicates that each of the characters in the contents of that element are to be handled as if they were separate elements with the corresponding strength. In the basic syntax, these are expressed by adding a * to the operation.

Abbreviating Ordering Specifications
Basic Symbol Basic Example Equivalent XML Symbol XML Example Equivalent
<* & a
<* bcd 
& a
< b < c < d 
<pc> <reset>a<reset>
<pc>bcd</pc>
<reset>a<reset>
<p>b</p><p>c</p><p>d</p>
<<* & a
<<* àáâã
& a
<< à << á << âã
<sc> <reset>a<reset>
<sc>àáâã</sc>
<reset>a<reset>
<s>à</s><s>á</s><s>â</s><s>ã</s>
<<<* & p
<<<* PpP
& p
<<< P <<< <<<
<tc> <reset>p<reset>
<tc>PpP</tc>
<reset>p<reset>
<t>P</t><t></t><t></t>
=* & v
=* VwW
& v
= V = w = W
<ic> <reset>v<reset>
<ic>VwW</ic>
<reset>v<reset>
<i>V</i><i>w</i><i>W</i>

3.6 Contractions

To sort a sequence as a single item (contraction), just use the sequence, for example,

Specifying Contractions
Basic Example XML Example Description
& k
< ch
<reset>k</reset>
<p>ch</p>
Make the sequence 'ch' sort after 'k', as a primary (base-character) difference

3.7 Expansions

<!ELEMENT x (context?, ( p | pc | s | sc | t | tc | i | ic )*, extend? ) >

There are two ways to handle expansions (where a character sorts as a sequence) with both the basic syntax and the XML syntax. The first method is to reset to the sequence of characters. This is called sequence expansion syntax. The second is to use the extension sequence. Both are equivalent in practice (unless the reset sequence happens to be a contraction). This is called normal expansion syntax.

Specifying Expansions
Basic XML Description
& c
<< k / h
<reset>c</reset>
<x><s>k</s> <extend>h</extend></x>
normal expansion syntax:
Make 'k' sort after the sequence 'ch'; thus 'k' will behave as if it expands to a character after 'c' followed by an 'h'.
& ch
<< k
<reset>ch</reset>
<s>k</s>
sequence expansion syntax:
Make 'k' sort after the sequence 'ch'; thus 'k' will behave as if it expands to a character after 'c' followed by an 'h'.

(unless 'ch' is defined beforehand as a contraction).

If an <extend> element is necessary, it requires the rule to be embedded in an <x> element.

The sequence expansion syntax can be quite tricky, so it should be avoided where possible. In particular:

Each extension replaces the one before it; it does not append to it. So

& ab << c
& cd << e

is equivalent to:

& a << c / b << e / d

and produces the following weights (where p(x) is the primary weight and s(a) is the secondary weight):

Character Weights
c p(a), p(b); s(a)+1, s(b); ...
e p(a), p(d); s(a)+2, s(d); ...

When expressing rules as atomic rules, the sequences must first be transformed into normal expansion syntax:

Expansion Sequence Normal Expansion Equivalent Atomic Rules
& ab << q <<< Q
& ad <<< AD < x <<< X
& a << q / b <<< Q / b
& a <<< AD / d < x <<< X
& a << q / b
& q <<< Q / b
& a <<< AD / d
& AD < x
& x<<< X

3.8 Context Before

The context before a character can affect how it is ordered, such as in Japanese. This could be expressed with a combination of contractions and expansions, but is faster using a context. (The actual weights produced are different, but the resulting string comparisons are the same.) If a context element occurs, it must be the first item in the rule, and requires an <x> element.

For example, suppose that "-" is sorted like the previous vowel. Then one could have rules that take "a-", "e-", and so on. However, that means that every time a very common character (a, e, ...) is encountered, a system will slow down as it looks for possible contractions. An alternative is to indicate that when "-" is encountered, and it comes after an 'a', it sorts like an 'a', and so on.

Specifying Previous Context
Basic XML
& a <<< a | - 
& e <<< e | -  
...
<reset>a</reset><x><context>a</context><s>-</s></x>
<reset>e</reset><x><context>e</context><s>-</s></x>
...

Both the context and extend elements can occur in an <x> element. For example, the following are allowed:

3.9 Placing Characters Before Others

There are certain circumstances where characters need to be placed before a given character, rather than after. This is the case with Pinyin, for example, where certain accented letters are positioned before the base letter. That is accomplished with the following syntax.

Placing Characters Before Others
Item Options Basic Example  XML Example
before primary
secondary
tertiary
& [before 2] a
<< à
<reset before="secondary">a</reset>
<s>à</s>

It is an error if the strength of the before relation is not identical to the relation after the reset. Thus the following are errors, since the value of the before attribute does not agree with the relation <s>.

Basic Example  XML Example
& [before 2] a
< à
<reset before="primary">a</reset>
<s>à</s>
Error
& [before 2] a
<<< à
<reset before="tertiary">a</reset>
<s>à</s>
Error

3.10 Logical Reset Positions

<!ELEMENT reset ( ... | first_variable| last_variable | first_tertiary_ignorable | last_tertiary_ignorable | first_secondary_ignorable | last_secondary_ignorable | first_primary_ignorable | last_primary_ignorable | first_non_ignorable | last_non_ignorable | first_trailing | last_trailing )* >

The CLDR table (based on UCA) has the following overall structure for weights, going from low to high.

Specifying Logical Positions
Name Description UCA Examples
first tertiary ignorable
...
last tertiary ignorable
p, s, t = ignore Control Codes
Format Characters
Hebrew Points
Tibetan Signs
...
first secondary ignorable
...
last secondary ignorable
p, s = ignore None in UCA
first primary ignorable
...
last primary ignorable
p = ignore Most combining marks
first variable
...
last variable
if alternate = non-ignorable
p != ignore,
if alternate = shifted
p, s, t = ignore
Whitespace,
Punctuation
first non-ignorable
...
last non-ignorable
p != ignore General Symbols
Currency Symbols
Numbers
Latin
Greek
...
implicits p != ignore, assigned automatically CJK, CJK compatibility (those that are not decomposed)
CJK Extension A, B
Unassigned
first trailing
...
last trailing
p != ignore,
used for trailing syllable components
Jamo Trailing
Jamo Leading

Each of the above Names (except implicits) can be used with a reset to position characters relative to that logical position. That allows characters to be ordered before or after a logical position rather than a specific character.

Note: The reason for this is so that tailorings can be more stable. A future version of the UCA might add characters at any point in the above list. Suppose that you set character X to be after Y. It could be that you want X to come after Y, no matter what future characters are added; or it could be that you just want Y to come after a given logical position, for example, after the last primary ignorable.

Here is an example of the syntax:

Sample Logical Position
Basic XML
& [first tertiary ignorable]
<< à
<reset><first_tertiary_ignorable/></reset>
<s>à</s>

For example, to make a character be a secondary ignorable, one can make it be immediately after (at a secondary level) a specific character (like a combining dieresis), or one can make it be immediately after the last secondary ignorable.

The last-variable element indicates the "highest" character that is treated as punctuation with alternate handling. Unlike the other logical positions, it can be reset as well as referenced. For example, it can be reset to be just above spaces if all visible punctuation are to be treated as having distinct primary values.

Specifying Last-Variable
Attribute Options Basic Example  XML Example
variableTop at & x
= [last variable]
<reset>x</reset>
<i><last_variable/></i>
after & x
< [last variable]
<reset>x</reset>
<p><last_variable/></p>
before & [before 1] x
< [last variable]
<reset before="primary">x</reset>
<p><last_variable/></p>

The default value for last-variable is the highest punctuation mark, thus below symbols. The value can be further changed by using the variable-top setting. This takes effect, however, after the rules have been built, and does not affect any characters that are reset relative to the last-variable value when the rules are being built. The variable-top setting might also be changed via a runtime parameter. That also does not effect the rules.

The <last_variable/> cannot occur inside an <x> element, nor can there be any element content. Thus there can be no <context> or <extend> or text data in the rule. For example, the following are all disallowed:

3.11 Special-Purpose Commands

<!ELEMENT import EMPTY >
<!ATTLIST import source CDATA #REQUIRED >
<!ATTLIST import type CDATA #IMPLIED >

The import command imports rules from another collation. This allows for better maintenance and smaller rule sizes. The source is the locale of the source, and the type is the type (if any). If the source is "locale" it is the same locale. The type is defaulted to "standard".

Example:

<import source="de" type="phonebook"/>

Special-Purpose Commands
Basic XML
[suppress contractions [Љ-ґ]] <suppress_contractions>[Љ-ґ]</suppress_contractions>
[optimize [Ά-ώ]] <optimize>[Ά-ώ]</optimize>

The suppress contractions tailoring command turns off any existing contractions that begin with those characters. It is typically used to turn off the Cyrillic contractions in the UCA, since they are not used in many languages and have a considerable performance penalty. The argument is a Unicode Set.

The optimize tailoring command is purely for performance. It indicates that those characters are sufficiently common in the target language for the tailoring that their performance should be enhanced.

The reason that these are not settings is so that their contents can be arbitrary characters.


Example:

The following is a simple example that combines portions of different tailorings for illustration. For more complete examples, see the actual locale data: Japanese, Chinese, Swedish, and German (type="phonebook") are particularly illustrative.

<collation>
  <settings caseLevel="on"/>
  <rules>
        <reset>Z</reset>
        <p>æ</p>
        <t>Æ</t>
        <p>å</p>
        <t>Å</t>
        <t>aa</t>
        <t>aA</t>
        <t>Aa</t>
        <t>AA</t>
        <p>ä</p>
        <t>Ä</t>
        <p>ö</p>
        <t>Ö</t>
        <s>ű</s>
        <t>Ű</t>
        <p>ő</p>
        <t>Ő</t>
        <s>ø</s>
        <t>Ø</t>
        <reset>V</reset>
        <tc>wW</tc>
        <reset>Y</reset>
        <tc>üÜ</tc>
        <reset><last_non_ignorable/></reset>
        <!-- following is equivalent to <p>亜</p><p>唖</p><p>娃</p>... -->
        <pc>亜唖娃阿哀愛挨姶逢葵茜穐悪握渥旭葦芦</pc>
        <pc>鯵梓圧斡扱</pc>
  </rules>
</collation>

3.12 Collation Reordering

Collation reordering allows scripts and certain other defined blocks of characters to be moved relative to each other parametrically, without changing the detailed rules for all the characters involved. This reordering is done on top of any specific ordering rules within the script or block currently in effect. Reordering can specify groups to be placed at the start and/or the end of the collation order. For example, to reorder Greek characters before Latin characters, and digits afterwards (but before other scripts), the following can be used:

Basic syntax XML Locale Identifier
[reorder Grek Latn digit] <settings reorder="Grek Latn digit"/> en-u-kr-grek-latn-digit

In each case, a sequence of reorder_codes is used, separated by spaces for Basic and XML syntax, and by hyphens for locale identifiers.

A reorder_code is any of the following special codes:

  1. space, punct, symbol, currency, digit - core groups of characters below 'a'
  2. any script code from the Recommended Table in UAX 31 except Katakana, Common, and Inherited.
    1. Katakana characters are are always reordered with Hiragana.
    2. Characters in any script not in the Recommended Table are treated as being in the preceding Recommended script, in DUCET order. Thus Phoenician characters always reorder with Hebrew characters.
  3. others - where all codes not explicitly mentioned should be ordered. The script code Zzzz (Unknown Script) is a synonym for others.

It is an error if a code occurs multiple times.

Interpretation of a reordering list

The reordering list is interpreted as if it were processed in the following way.

  1. If any core code is not present, then it is inserted at the front of the list in the order given above.
  2. If the others code is not present, then it is inserted at the end of the list.
  3. The others code is replaced by the list of all script codes not explicitly mentioned, in DUCET order.
  4. The reordering list is now complete, and used to reorder characters in collation accordingly.

The locale data may have a particular ordering. For example, the Czech locale data could put digits after all letters, with [reorder others digit]. Any reordering codes specified on top of that (such as with a bcp47 locale identifier) completely replace what was there. To specify a version of collation that completely resets any existing reordering to the DUCET ordering, the single code others can be used, as below.

Examples:

Locale Identifier Effect
en-u-kr-latn-digit Reorder digits after Latin characters (but before other scripts like Cyrillic).
en-u-kr-others-digit Reorder digits after all other characters.
en-u-kr-arab-cyrl-others-symbol Reorder Arabic characters first, then Cyrillic, and put symbols at the end—after all other characters.
en-u-kr-others Remove any locale-specific reordering, and use DUCET order for reordering blocks.

The default reordering groups are defined by the FractionalUCA.txt file, based on the primary weights of associated collation elements. The [top_byte] table contains a mapping from the first (top) byte of primary weights to the associated reordering group. For example:

U+02D0 MODIFIER LETTER TRIANGULAR COLON has a fractional UCA collation weight of [0E 0B, 05, 05]. In the [top_byte] table, the line [top_byte 0E SYMBOL] indicates that 0E maps to SYMBOL.

There are some special cases:

The default reordering groups follow the allkeys_CLDR.txt ordering; they also may be tailored by implementations to different values. For more information on FractionalUCA.txt and allkeys_CLDR.txt, see Collation Auxiliary.

The DUCET ordering is slightly different from the allkeys_CLDR ordering. The reordering groups for the DUCET are not specified here. However, most reordering groups would start with the same characters as in FractionalUCA.txt.

3.13 Case Parameters

The case level is an optional intermediate level ("2.5") between Level 2 and Level 3 (or after Level 1, if there is no Level 2 due to strength settings). The case level is used to support two parametric features: ignoring non-case variants (Level 3 differences) except for case, and giving case differences a higher-level priority than other tertiary differences. Distinctions between small and large Kana characters are also included as case differences, to support Japanese collation.

The case first parameter controls whether to swap the order of upper and lowercase. It can be used with or without the case level.

Importantly, the case parameters have no effect in many instances. For example, they have no effect on the comparison of two non-ignorable characters with different primary weights, or with different secondary weights if the strength = secondary (or higher).

When either the case level or case first parameters are set, the following describes the derivation of the modified collation elements. It assumes the original levels for the code point are [p.s.t] (primary, secondary, tertiary). This derivation may change in future versions of LDML, to track the case characteristics more closely.

Untailored Characters

For untailored characters and strings, that is, for mappings in the root collation, the case value for each collation element is computed from the tertiary weight listed in allkeys_CLDR.txt. This is used to modify the collation element.

  1. If the character is U+FFFE (lowest-weight), set case value = LOWEST.
  2. Otherwise, look up a case value for the tertiary weight x of each collation element:
    1. UPPER if x ∈ {08-0C, 0E, 11, 12, 1D}
    2. UNCASED otherwise
Compute Modified Collation Elements

From a computed case value, set a weight c according to the following.

  1. If the value is LOWEST, set c = 1
  2. Otherwise if CaseFirst=UpperFirst, set c = UPPER ? 2 : MIXED ? 3 : 4
  3. Otherwise set c = UPPER ? 4 : MIXED ? 3 : 2

Compute a new collation element according to the following table. The notation xt means that the values are numerically combined into a single level, such that xt < yu whenever x < y. The fourth level (if it exists) is unaffected.

Case Level Strength Original CE Modified CE Comment
on primary 0.s.t 0.0 ignore case level weights of primary-ignorable CEs
p.s.t p.c
secondary
or higher
0.0.t 0.0.0.t ignore case level weights of secondary-ignorable CEs
0.s.t 0.s.c.t
p.s.t p.s.c.t
off any 0.0.0 0.0.00 ignore case level weights of tertiary-ignorable CEs
0.0.t 0.0.4t
0.s.t 0.s.ct
p.s.t p.s.ct

Note the special case weights when s = 0. They ensure the construction of well-formed case and tertiay weights. For details, see Section 3.7, Well-Formed Collation Element Tables in [UCA].

Tailored Strings

Characters and strings that are tailored (e.g., via LDML/XML collation syntax or basic collation syntax) have case values computed from their UCD properties. A known limitation of the tailoring is that where the source string is a contraction of cased characters, the case level does not reflect the difference in mixed cases, such as between "dZ" and Dz".

  1. Form a set of case values by looking up a case value for each character x in the NFKD mapping of the source string, based on UCD properties:
    1. UNCASED if x ∈ UncasedExceptions
    2. LOWER if x ∈ Lowercase or x ∈ Changes_When_Uppercased or x ∈ LowerExceptions
    3. UPPER if x ∈ Uppercase or x ∈ Changes_When_Lowercased or x ∈ UpperExceptions
    4. MIXED if x ∈ gc=Lt or both (a) and (b)
    5. UNCASED otherwise
  2. Compute a single case value from this set, by first removing UNCASED, then setting:
    1. MIXED if not all elements are identical, otherwise
    2. UPPER if the set contains UPPER, otherwise
    3. LOWER
  3. Apply that case-value to the first collation element in the tailoring, according to "Compute Modified Collation Elements". The case values and weights in an expansion are unaffected.

UncasedExceptions: is the set of letter modifiers

LowerExceptions: is the set of small letters where script=Hiragana or Katakana, plus other characters lowercase in form. In Unicode 6.2, these are:

UpperExceptions: is the set of non-small letters where script=Hiragana or Katakana, minus the iteration mark, plus other characters uppercase in form. In Unicode 6.2, these are:

3.14 Visibility

<!ATTLIST collation visibility ( internal | external ) "external" >

Collators have external visibility by default, meaning that they can be displayed in a list of collation options for users to choose from. Collators marked as having internal visibility should not be shown in such a list. Collators are typically internal when they are partial sequences included in other collators.

3.15 Collation Indexes

Index Characters

The main data includes <exemplarCharacters> for collation indexes. See the main document, Section 5.6 Character Elements, for general information about exemplar characters.

The index characters are a set of characters for use as a UI "index", that is, a list of clickable characters (or character sequences) that allow the user to see a segment of a larger "target" list. Each character corresponds to a bucket in the target list. One may have different kinds of index lists; one that produces an index list that is relatively static, and the other is a list that produces roughly equally-sized buckets. While CLDR is mostly focused on the first, there is provision for supporting the second as well.

The index characters need to be used in conjunction with a collation for the locale, which will determine the order of the characters. It will also determine which index characters show up.

The static list would be presented as something like the following (either vertically or horizontally):

… A B C D E F G H CH I J K L M N O P Q R S T U V W X Y Z …

In the "A" bucket, you would find all items that are primary greater than or equal to "A" in collation order, and primary less than "B". The use of the list requires that the target list be sorted according to the locale that is used to create that list. Although we say "character" above, the index character could be a sequence, like "CH" above. The index exemplar characters must always be used with a collation appropriate for the locale. Any characters that do not have primary differences from others in the set should be removed.

Details:

  1. The primary weight (according to the collation) is used to determine which bucket a string is in. There are special buckets for before the first character, between buckets of different scripts, and after the last bucket (and of a different script).
  2. Characters in the index characters do not need to have distinct primary weights. That is, the index characters are adapted to the underlying collation: normally Ё is in the Е bucket for Russian, but if someone used a variant of Russian collation that distinguished them on a primary level, then Ё would show up as its own bucket.
  3. If an index character string ends with a single "*" (U+002A), for example "Sch*" and "St*" in German, then there will be a separate bucket for the string minus the "*", for example "Sch" and "St", even if that string does not sort distinctly.
  4. An index character can have multiple primary weights, for example "Æ" and "Sch". Names that have the same initial primary weights sort into this index character’s bucket. This can be achieved by using an upper-boundary string that is the concatenation of the index character and U+FFFF, for example "Æ\uFFFF" and "Sch\uFFFF". Names that sort greater than this upper boundary but less than the next index character are redirected to the last preceding single-primary index character (A and S for the examples here).

For example, for index characters [A Æ B R S {Sch*} {St*} T] the following sample names are sorted into an index as shown.

The … items are special: each is a bucket for everything else, either less or greater. They are inserted at the start and end of the index list, and on script boundaries. These are really script boundaries, not reordering code boundaries. Each script has its own range, except where scripts sort primary-equal (e.g., Hira & Kana). All characters that sort in the low reordering groups (whitespace, punctuation, symbols, currency symbols, digits) are treated as a single script for this purpose. So if you had a collation that reordered Hebrew after Ethiopic, you would still get index boundaries between the following (and in that order):

  1. Ethiopic
  2. Hebrew
  3. Phoenician // included in the Hebrew reordering group
  4. Samaritan // included in the Hebrew reordering group
  5. Devanagari

If you tailor a Greek character into the Cyrillic script, that Greek character will be bucketed (and sorted) among the Cyrillic ones.

In the UI, an index character could also be omitted or grayed out if its bucket is empty. For example, if there is nothing in the bucket for Q, then Q could be omitted. That would be up to the implementation. Additional buckets could be added if other characters are present. For example, we might see something like the following:

Sample Greek Index
Contents
 Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω
With only content beginning with Greek letters 
 … Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω …
With some content before or after
 … 9 Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω …
With numbers, and nothing between 9 and Alpha
  … 9 A-Z Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω …
With numbers, some Latin

Here is a sample of the XML structure:

<exemplarCharacters type="index">[A B C D E F G H I J K L M N O P Q R S T U V W X Y Z]</exemplarCharacters>

The display of the index characters can be modified with the Index labels elements, discussed in the main document, Section 5.6.4 Index Labels.

CJK Index Markers

Special index markers have been added to the CJK collations for stroke, pinyin, zhuyin, and unihan. These markers allow for effective and robust use of indexes for these collations. For example, near the start of the pinyin tailoring there is the following:

<p> A</p><!-- INDEX A -->
<pc>阿呵𥥩锕𠼞𨉚</pc><!-- ā -->

<pc>翶</pc><!-- ao -->
<p> B</p><!-- INDEX B -->

These indicate the boundaries of "buckets" that can be used for indexing. They are always two characters starting with the noncharacter U+FDD0, and thus will not occur in normal text. For pinyin the second character is A-Z; for unihan it is one of the radicals; and for stroke it is a character after U+2800 indicating the number of strokes, such as ⠁. For zhuyin the second character is one of the standard Bopomofo characters in the range U+3105 through U+3129.

The corresponding bucket label strings are the boundary strings with the leading U+FDD0 removed. For example, the Pinyin boundary string "\uFDD0A" yields the label string "A".

However, for stroke order, the label string is the stroke count (second character minus U+2800) as a decimal-digit number followed by 劃 (U+5283). For example, the stroke order boundary string "\uFDD0\u2805" yields the label string "5劃".