Exemplar Vs. Collation Sanity Check

The file exemplar_collation_check.txt in this directory is a comparison of the Exemplar characters with the tailored Collation characters, generated for sanity checking.

Notes:

- IE doesn't see the file as UTF-8; you have to manually change the encoding. I find it simpler to download and view in Notepad or other program.

- These are not necessarily real failures; simply items to check for.

- The test case doesn't distinguish locales that have identical collation tables, so items like Japanese or Arabic will repeat multiple times; just ignore the ones that don't apply

- collation rules can be tricky; for more information, see http://oss.software.ibm.com/cvs/icu/~checkout~/locale/data_formats.html#Collation

Examples:

Failure at Hungarian
Characters in Collation Set but not in Exemplar Set
0111 # Ll [1] (đ) U+0111 LATIN SMALL LETTER D WITH STROKE
0063,0073 # (cs) U+0063 LATIN SMALL LETTER C,U+0073 LATIN SMALL LETTER S
0067,0079 # (gy) U+0067 LATIN SMALL LETTER G,U+0079 LATIN SMALL LETTER Y
006C,0079 # (ly) U+006C LATIN SMALL LETTER L,U+0079 LATIN SMALL LETTER Y
0073,007A # (sz) U+0073 LATIN SMALL LETTER S,U+007A LATIN SMALL LETTER Z
007A,0073 # (zs) U+007A LATIN SMALL LETTER Z,U+0073 LATIN SMALL LETTER S

Here the đ is not an error, but the cs, etc. should be in the exemplar set (bug already filed).

Failure at Hindi
Characters in Collation Set but not in Exemplar Set
0964..0965 # Po [2] (।..॥) U+0964 DEVANAGARI DANDA..U+0965 DEVANAGARI DOUBLE DANDA
0970 # Po [1] (॰) U+0970 DEVANAGARI ABBREVIATION SIGN
0915,0901 # (कँ) U+0915 DEVANAGARI LETTER KA,U+0901 DEVANAGARI SIGN CANDRABINDU
0915,0902 # (कं) U+0915 DEVANAGARI LETTER KA,U+0902 DEVANAGARI SIGN ANUSVARA
0915,0903 # (कः) U+0915 DEVANAGARI LETTER KA,U+0903 DEVANAGARI SIGN VISARGA
0915,093D # (कऽ) U+0915 DEVANAGARI LETTER KA,U+093D DEVANAGARI SIGN AVAGRAHA
0915,093E # (का) U+0915 DEVANAGARI LETTER KA,U+093E DEVANAGARI VOWEL SIGN AA
....

http://oss.software.ibm.com/cvs/icu/~checkout~/locale/collation_diff/hi_IN_collation.html

The Hindi collation rules list all the combinations of base + matra. This is superfluous, since if all the matras have primary weights greater than the bases, the right order will occur. So unless there are specific combinations of characters that change order, rules should simply have the correct ordering of base letters (and only include those that *differ* from the UCA rules (http://www.unicode.org/charts/collation/), followed by the correct ordering of the matras (with primary order, since they are secondary in UCA).

delete the following; AFTER checking that that UCA order is ok
<ॐ
<।
<॥
<॰
<०
...
<ओ
<औ
<क
<क़=क़
<कँ
<कं
<कः
<क॑
...
<ह

retain the following:

<़
<ँ
<ं
<ः
...
<ॊ
<ो
<ौ

Failure at Maltese (Malta)
Characters in Collation Set but not in Exemplar Set
0063 # Ll [1] (c) U+0063 LATIN SMALL LETTER C

http://oss.software.ibm.com/cvs/icu/~checkout~/locale/collation_diff/mt_MT_collation.html

Here the problem is that the rules are trying to sort some character sequences *before* a base character, e.g.
& B
< ċ
<<<Ċ
< c
<<<C

This works, but is sub-optimal for two reasons.
1. it tailors c/C when it doesn't need to be; any extra tailoring generally makes for longer sort keys.
2. by tailoring c/C, it puts other those things that are after b/B after c/C instead. See http://www.unicode.org/charts/collation/ for examples.

The correct rules should be:

& [before 1] c < ċ <<< Ċ

This finds the highest primary (that's what the 1 is for) character less than c, and uses that as the reset point.

For Maltese, the same technique needs to be used for ġ and ż.