The file exemplar_collation_check.txt
in this directory is a comparison of the Exemplar characters with the tailored
Collation characters, generated for sanity checking.
Notes:
- IE doesn't see the file as UTF-8; you have to manually change the encoding. I
find it simpler to download and view in Notepad or other program.
- These are not necessarily real failures; simply items to check for.
- The test case doesn't distinguish locales that have identical collation
tables, so items like Japanese or Arabic will repeat multiple times; just ignore
the ones that don't apply
- collation rules can be tricky; for more information, see http://oss.software.ibm.com/cvs/icu/~checkout~/locale/data_formats.html#Collation
Examples:
Failure at Hungarian
Characters in Collation Set but not in Exemplar Set
0111 # Ll [1] (đ) U+0111 LATIN SMALL LETTER D WITH STROKE
0063,0073 # (cs) U+0063 LATIN SMALL LETTER C,U+0073 LATIN SMALL LETTER S
0067,0079 # (gy) U+0067 LATIN SMALL LETTER G,U+0079 LATIN SMALL LETTER Y
006C,0079 # (ly) U+006C LATIN SMALL LETTER L,U+0079 LATIN SMALL LETTER Y
0073,007A # (sz) U+0073 LATIN SMALL LETTER S,U+007A LATIN SMALL LETTER Z
007A,0073 # (zs) U+007A LATIN SMALL LETTER Z,U+0073 LATIN SMALL LETTER S
Here the đ is not an error, but the cs, etc. should be in the exemplar set
(bug already filed).
Failure at Hindi
Characters in Collation Set but not in Exemplar Set
0964..0965 # Po [2] (।..॥) U+0964 DEVANAGARI DANDA..U+0965
DEVANAGARI DOUBLE DANDA
0970 # Po [1] (॰) U+0970 DEVANAGARI ABBREVIATION SIGN
0915,0901 # (कँ) U+0915 DEVANAGARI LETTER KA,U+0901 DEVANAGARI SIGN
CANDRABINDU
0915,0902 # (कं) U+0915 DEVANAGARI LETTER KA,U+0902 DEVANAGARI SIGN
ANUSVARA
0915,0903 # (कः) U+0915 DEVANAGARI LETTER KA,U+0903 DEVANAGARI SIGN
VISARGA
0915,093D # (कऽ) U+0915 DEVANAGARI LETTER KA,U+093D DEVANAGARI SIGN
AVAGRAHA
0915,093E # (का) U+0915 DEVANAGARI LETTER KA,U+093E DEVANAGARI VOWEL
SIGN AA
....
http://oss.software.ibm.com/cvs/icu/~checkout~/locale/collation_diff/hi_IN_collation.html
The Hindi collation rules list all the combinations of base + matra. This is
superfluous, since if all the matras have primary weights greater than the
bases, the right order will occur. So unless there are specific combinations of
characters that change order, rules should simply have the correct ordering of
base letters (and only include those that *differ* from the UCA rules (http://www.unicode.org/charts/collation/),
followed by the correct ordering of the matras (with primary order, since they
are secondary in UCA).
delete the following; AFTER checking that that UCA order is ok
<ॐ
<।
<॥
<॰
<०
...
<ओ
<औ
<क
<क़=क़
<कँ
<कं
<कः
<क॑
...
<ह
retain the following:
<़
<ँ
<ं
<ः
...
<ॊ
<ो
<ौ
Failure at Maltese (Malta)
Characters in Collation Set but not in Exemplar Set
0063 # Ll [1] (c) U+0063 LATIN SMALL LETTER C
http://oss.software.ibm.com/cvs/icu/~checkout~/locale/collation_diff/mt_MT_collation.html
Here the problem is that the rules are trying to sort some character sequences
*before* a base character, e.g.
& B
< ċ
<<<Ċ
< c
<<<C
This works, but is sub-optimal for two reasons.
1. it tailors c/C when it doesn't need to be; any extra tailoring generally
makes for longer sort keys.
2. by tailoring c/C, it puts other those things that are after b/B after c/C
instead. See http://www.unicode.org/charts/collation/
for examples.
The correct rules should be:
& [before 1] c < ċ <<< Ċ
This finds the highest primary (that's what the 1 is for) character less than c,
and uses that as the reset point.
For Maltese, the same technique needs to be used for ġ and ż.