Proposed Collation Changes for 6.0

L2/10-275R

Proposed Collation Changes for 6.0.

Date: 2010-08-04

From: Mark Davis, CLDR-TC

CLDR is planning to use a tailored DUCET in the root locale. This will be inherited by all other locales by default. However, there will be a separate collation also in root, with the keyword “ducet”. Using that keyword, a locale ID such as “und-u-co-ducet” will allow access to the original DUCET table.

The following lists the changes that are planned. We also propose that the first two types of changes be done in the UCA for 6.0.

Reference links

● For the current (5.2) DUCET ordering, see http://unicode.org/charts/collation/.

● For the 6.0 DUCET ordering (with script/GC), see http://macchiato.com/utc/uca/6.0.0/UCA_Rules_NoCE.txt.

● The ICU collation demo may also be useful: http://goo.gl/GQuI.

1. Currency/Punctuation

There are only a couple of instances where currency characters are not grouped together, or where punctuation marks are not below Latin. (This is important for collation reordering, as described below in #3.)

U+20A8 ( ₨ ) RUPEE SIGN

U+FDFC ( ﷼ ) RIAL SIGN

U+19DE ( ᧞ ) NEW TAI LUE SIGN LAE

U+19DF ( ᧟ ) NEW TAI LUE SIGN LAEV

(These are sorting with the respective scripts, rather than with similar symbols and punctuation.)

Proposed correction in 6.0.

Insert 20A8 and FDFC after

20B8 ; [.1598.0020.0002.20B8] # TENGE SIGN

(Note: either could go earlier in the currency signs; for example, Rupee could go after U+20A7 ( ₧ ) PESETA SIGN.)

Insert 19DE and 19DF after a variable punctuation mark, such as:

2016 ; [*0576.0020.0002.2016] # DOUBLE VERTICAL LINE

ICU format (has LDML equivalent):

& '₸' # TENGE SIGN

< '₨' # RUPEE SIGN

< '﷼' # RIAL SIGN

&              '‖'              # DOUBLE VERTICAL LINE (the last punctuation mark in Variable)
<              '᧞'              # NEW TAI LUE SIGN LAE
<              '᧟'              # NEW TAI LUE SIGN LAEV

2. Characters that should be ignorable

The following two characters have sort weights (unless alternate=shifted). They should always be ignorable (whatever the alternate handling is), and thus have weights starting with 0000, so that they do not affect collation or searching (except with the IDENTICAL strength).

Proposed correction in 6.0.

0640 ; [*020C.0020.0002.0640] # ARABIC TATWEEL

07FA ; [*020D.0020.0002.07FA] # NKO LAJANYALAN

→

0640 ; [.0000.0000.0000.0640] # ARABIC TATWEEL

07FA ; [.0000.0000.0000.07FA] # NKO LAJANYALAN

ICU format (has LDML equivalent):

& [last tertiary ignorable]

= 'ߺ' # NKO LAJANYALAN

= 'ـ' # ARABIC TATWEEL

3. Grouping Punctuation (informational)

The tailored DUCET will also group all punctuation together in Variable, below the symbols. This is for discussion in the UTC, but no action is requested in 6.0.

There are two main reasons for this tailoring:

A. Punctuation and Symbols generally want to be considered differently in collation. Collation sequences are closely coordinated with searching, and while it is quite common to ignore spaces or (),;..., people don't normally want to identify “INY” with “I♥NY” in searching or sorting. Because they are currently intermixed in the Variable category, this is not really feasible.

B. For collation reordering, this is also important. Collation reordering lets people reorder scripts and other semantically important classes in collation parametrically, without changing tables. The following categories are those important classes: punctuation, decimal numbers, Sc, and other symbols. This lets people get their own script first in an index, for example: eg, Cyrillic before Latin. As another example, DIN 5007 and other standards specify that numbers should sort after letters. These categories do not have to be completely "pure", but should basically contain all the characters in that "class", with perhaps some intermixed characters that behave similarly. But having them be contiguous allows reordering to treat them as a reorderable chunk.

The following is the punctuation tailoring for this. Note that the relative order of the punctuation within this list is left alone (that is, it matches the DUCET). Note also that this will have very little effect on an implementation that uses alternate=shifted, since the relative order of symbols and punctuation is completely swamped by any base-letter differences between strings. For alternate=non-ignorable, it only matters where a punctuation mark in one string is compared to a symbol in another, eg, “I♥NY” vs “"I-NY”

ICU format (has LDML equivalent):

&              ' '              # OGHAM SPACE MARK (the last spacing mark)
<*              '᳓‾﹉﹊﹋﹌_＿﹍﹎﹏︳︴‗-－﹣֊᐀᭠᠆᠇‐‑‒–︲—﹘︱―⁓⸗〜〰゠・･,，﹐︐՝،؍٫٬߸᠂᠈꓾꘍꛵、﹑､︑﹅﹆;'
<*              '；﹔︔؛⁏꛶:：﹕︓։؞܃܄܅܆܇܈࠰࠱࠲࠳࠴࠵࠶࠷࠸࠹࠺࠻࠼࠽࠾፡፣፤፥፦᠄᠅៖᭝꧇᛫᛬᛭꛴!！﹗︕‼⁉¡՜߹᥄?？﹖'
<*              '︖⁈⁇¿⸮՞؟܉፧᥅⳺⳻꘏꛷‽⸘.．․﹒‥︰…︙᠁۔܁܂።᠃᠉᙮᭜⳹⳾⸰꓿꘎꛳。｡︒·⸱।॥꣎꣏᰻᰼꡶꡷᜵᜶꤯၊။។៕᪨'
<*              '᪩᪪᪫᭞᭟꧈꧉꩝꩞꩟꯫᱾᱿܀߷჻፨᨞᨟᭚᭛꧁꧂꧃꧄꧅꧆꧊꧋꧌꧍꛲꥟⁕'
<*              '⁖⁘⁙⁚⁛⁜⁝⁞⸪⸫⸬⸭⳼⳿⸙'＇‘’‚‛‹›"＂“”„‟〝〞〟«»(（﹙⁽₍︵'
<*              ')）﹚⁾₎︶[［﹇]］﹈{｛﹛︷}｝﹜︸༺༻༼༽᚛᚜⁅⁆⧼⧽⦃⦄⦅｟⦆｠⦇⦈⦉⦊⦋⦌⦍⦎⦏⦐⦑⦒⦓⦔⦕⦖⦗⦘⟬⟭⟮⟯⸂⸃'
<*              '⸄⸅⸉⸊⸌⸍⸜⸝⸠⸡⸢⸣⸤⸥⸦⸧⸨⸩〈︿〉﹀《︽》︾「｢﹁」｣﹂『﹃』﹄【︻】︼〔﹝︹〕﹞︺〖︗〗︘〘〙〚〛﴾﴿⁋@＠﹫'
<*              '*＊﹡⁎⁑٭꙳/／\＼﹨&＆﹠⁊#＃﹟%％﹪٪‰؉‱؊†‡•‣‧⁃⁌⁍′″‴⁗‵‶‷〃〽‸※‿⁔⁀⁐⁁⁂⸀⸁⸆⸇⸈⸋⸎⸏'
<*              '⸐⸑⸒⸓⸔⸕⸖⸚⸛⸞⸟꙾՚՛՟־׀׃׆׳״܊܋܌܍᠀᠊॰꣸꣹꣺෴๚๛꫞꫟༄༅༆༇༈༉༊࿐࿑་༌།༎༏༐༑༒྅࿒࿓࿔᰽᰾᰿'
<*              '၌၍၎၏៘៙៚᪠᪡᪢᪣᪤᪥᪦᪬᪭᙭꡴꡵꤮꧞꧟꩜๏‖❨❩❪❫❬❭❮❯❰❱❲❳❴❵⟅'
<*              '⟆⟦⟧⟨⟩⟪⟫⧘⧙⧚⧛᧞᧟'

We may consider setting Variable Top to after this list. That would cause alternate=shift to only affect controls, spaces, and punctuation, not symbols. This would be done by appending the following rule to the above:

ICU format (has LDML equivalent):

< [variable top]

4. Other Collation Changes in CLDR (informational)

We have made a significant pass through the CLDR data to clean up the locale tailorings for CLDR 1.9. The exact list is at: http://unicode.org/cldr/trac/report/30. Notable among these changes are that we are planning to remove “backwards secondaries” from default French collation. Users will still be able to set this option parametrically or via locale keywords (“fr-u-kb-true”) when using French (or other languages); the only change is that it will no longer be the default.

The committee is planning for a PRI for the collation changes, soon after the UTC meeting.