Public Review Issue #175: CLDR 1.9 Collation Changes

The CLDR committee is making Unicode locale-sensitive collation a major focus for the next release, CLDR 1.9, and welcomes feedback on the planned changes. If you have any feedback on any of these actions, please file a comment in the relevant ticket, or file a new ticket at http://unicode.org/cldr/trac/newticket. The exact list of CLDR tickets is at: http://unicode.org/cldr/trac/report/30. More tickets may be added to this list over time. The planned changes include:
  1. Modifying the tailoring for many languages. These tailoring changes include:
  2. Basing Pinyin and Radical stroke collations on Unihan data. Draft rules are in http://www.unicode.org/review/pr-175/, and may be updated during the public review period. These include collations for pinyin, stroke, radical-stroke. For comparison, pinyin transliteration is also included. Some additional data sources are used besides Unihan.
  3. Removing “backwards secondaries” from default French collation. Users will still be able to set this option parametrically or via locale keywords (such as “fr-u-kb-true”) when using French (or other languages); the only change is that this option will no longer be the default for French.
  4. Scripts and certain other categories of characters (whitespace, currency symbols, punctuation, most numbers, other symbols) will be parametrically reorderable. For example, the rules for Greek would be able to specify that the sorting order is:
  5. Collation rules will also allow an “import” statement, allowing for the default European Ordering Rules to be used as a basis for languages of the European Union.
  6. The code point U+FFFF will be tailored to have a weight higher than all other characters, and disallowing further tailoring of U+FFFF for other collation variants. This allows reliable specification of a range, such as “Sch” ≤ X ≤ “Sch\uFFFF”.
  7. CLDR is planning to use a tailored UCA DUCET (Default Unicode Collation Element Table) in the root locale. This will be inherited by all other locales by default. However, there will be a separate collation also in root, with the keyword “ducet”. Using that keyword, the locale ID “und-u-co-ducet” will allow access to the original DUCET. The root locale ordering will be modified in the following ways: