UTS#10 (collation) : request for a new "Separating" mode for variable weighting (3.2.2)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Jul 31 2010 - 08:19:01 CDT

Next message: CE Whitehead: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."

Previous message: Andrew West: "Re: Indian new rupee sign"
Next in thread: Mark Davis â˜•: "Re: UTS#10 (collation) : request for a new "Separating" mode for variable weighting (3.2.2)"
Reply: Mark Davis â˜•: "Re: UTS#10 (collation) : request for a new "Separating" mode for variable weighting (3.2.2)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Something is not considered for now in UCA, the special behavior
introduced with Variable Weighting (UTS#10 3.2.2 Variable Weighting)
is too simplistic and quite confusive.

First.

The term chosen for the "Blanked" option is really confusive, when it
does not mean that these variable elements are not all trated like
blanks (with minor differences) but effectively "Ignored" (only
considered at the final implicit level for binary ordering of scalar
values, independantly of the selected collation level, and which is
still needed for sort stability). Could we change this term, without
changing its definition ?

Second.

The default ("Non-Ignorable") makes primary differences between all
the variable elements. For various reasons, there should exist a way
to ignore these differences at the primary level, even if these
variable characters are not ignored.

UCA proposes the alternative "Shifted" option, but its effect is to
make ALL these differences become level 4 differences only, droppping
all the specially tuned sublevels defined in the DUCET, by only
considering the primary level to create Level 4.

It also proposes the "Shift-Trimmed" option which has basically the
same effect (making variable elements having only level-4 differences
and forgetting completely how they are sorted), except that it's just
to emulate a POSIX behavior for NON-VARIABLE elements. This behavior
is more a bug of old POSIX implementations, and this should not even
be accepted as a standard behavior for conforming UCA. Then UCA is
transforming this bug into an acceptable behavior, still this is a
hack added on top of the "Shifted" option, which should not be
encouraged.

Third (the longest point).

So I propose another option: "Separating"
- Its effect would be to keep ALL the primary, secondary and ternary
weights of "Variable", and all "Ignorable" elements
(primary-ignorable), but shifting them by only one level up.
- A single primary weight would be inserted, and it would be any one
of the primary weights assigned by default to variable elements (but
the same for all variable elements, for example it could be 0209,
assigned as the primary weight for SPACE, or 0201 assigned to
HORIZONTAL TABULATION which is the first Variable element).
- The single primary weight inserted for ignorable characters (with
zero primary weight) would be also 0000.
- On non-variable elements (digits, letters, sinograms...) the fourth
level created is assigned weight FFFF (in a way similar to the
"Shifted" option)

However, it has an effect on the semantic of Level 2 collation
(ignoring case) because there would be small secondary differences
within variable elements that would gain a non-zero weight at level 3.
Note that the DUCET already assigns non-zero weights to variable
elements:

0009 ; [*0201.0020.0002.0009] # HORIZONTAL TABULATION (in 6429)
000A ; [*0202.0020.0002.000A] # LINE FEED (in 6429)
000B ; [*0203.0020.0002.000B] # VERTICAL TABULATION (in 6429)
000C ; [*0204.0020.0002.000C] # FORM FEED (in 6429)
000D ; [*0205.0020.0002.000D] # CARRIAGE RETURN (in 6429)
0020 ; [*020A.0020.0002.0020] # SPACE
...
0085 ; [*0206.0020.0002.0085] # NEXT LINE (in 6429)
00A0 ; [*020A.0020.001B.00A0] # NO-BREAK SPACE; QQK
00A1 ; [*026F.0020.0002.00A1] # INVERTED EXCLAMATION MARK
...
02FE ; [*041A.0020.0002.02FE] # MODIFIER LETTER OPEN SHELF
02FF ; [*041B.0020.0002.02FF] # MODIFIER LETTER LOW LEFT ARROW

0374 ; [*03E9.0020.0002.0374] # GREEK NUMERAL SIGN; QQC
0375 ; [*03EA.0020.0002.0375] # GREEK LOWER NUMERAL SIGN
037E ; [*0243.0020.0002.037E] # GREEK QUESTION MARK; QQC
0384 ; [*020F.0020.0002.0384] # GREEK TONOS; QQC
0385 ; [*0216.0020.0002.00A8][.0000.0032.0002.0301] # GREEK DIALYTIKA
TONOS; QQCM
0387 ; [*0292.0020.0002.0387] # GREEK ANO TELEIA; QQC
03F6 ; [*054B.0020.0002.03F6] # GREEK REVERSED LUNATE EPSILON SYMBOL
0482 ; [*044C.0020.0002.0482] # CYRILLIC THOUSANDS SIGN

055A ; [*0392.0020.0002.055A] # ARMENIAN APOSTROPHE
055B ; [*0393.0020.0002.055B] # ARMENIAN EMPHASIS MARK
055C ; [*0270.0020.0002.055C] # ARMENIAN EXCLAMATION MARK
055D ; [*0235.0020.0002.055D] # ARMENIAN COMMA
055E ; [*0276.0020.0002.055E] # ARMENIAN QUESTION MARK
055F ; [*0394.0020.0002.055F] # ARMENIAN ABBREVIATION MARK
0589 ; [*0248.0020.0002.0589] # ARMENIAN FULL STOP
058A ; [*0224.0020.0002.058A] # ARMENIAN HYPHEN

05BE ; [*0395.0020.0002.05BE] # HEBREW PUNCTUATION MAQAF
05C0 ; [*0396.0020.0002.05C0] # HEBREW PUNCTUATION PASEQ
05C3 ; [*0397.0020.0002.05C3] # HEBREW PUNCTUATION SOF PASUQ

05C6 ; [*0398.0020.0002.05C6] # HEBREW PUNCTUATION NUN HAFUKHA
05F3 ; [*0399.0020.0002.05F3] # HEBREW PUNCTUATION GERESH
05F4 ; [*039A.0020.0002.05F4] # HEBREW PUNCTUATION GERSHAYIM

0606 ; [*0566.0020.0002.0606] # ARABIC-INDIC CUBE ROOT
...
060F ; [*044F.0020.0002.060F] # ARABIC SIGN MISRA

061B ; [*0244.0020.0002.061B] # ARABIC SEMICOLON
...
06D4 ; [*0283.0020.0002.06D4] # ARABIC FULL STOP

06E9 ; [*0450.0020.0002.06E9] # ARABIC PLACE OF SAJDAH

0700 ; [*02B5.0020.0002.0700] # SYRIAC END OF PARAGRAPH
...
070D ; [*039E.0020.0002.070D] # SYRIAC HARKLEAN ASTERISCUS

07F6 ; [*0452.0020.0002.07F6] # NKO SYMBOL OO DENNEN
...
07FA ; [*020D.0020.0002.07FA] # NKO LAJANYALAN
0830 ; [*0250.0020.0002.0830] # SAMARITAN PUNCTUATION NEQUDAA
...
083E ; [*025E.0020.0002.083E] # SAMARITAN PUNCTUATION ANNAAU

0964 ; [*0294.0020.0002.0964] # DEVANAGARI DANDA
...
0970 ; [*03A1.0020.0002.0970] # DEVANAGARI ABBREVIATION SIGN
09F4 ; [*1110.0020.0002.09F4] # BENGALI CURRENCY NUMERATOR ONE
...
09FA ; [*0453.0020.0002.09FA] # BENGALI ISSHAR
0B70 ; [*0454.0020.0002.0B70] # ORIYA ISSHAR
0BF0 ; [*111C.0020.0002.0BF0] # TAMIL NUMBER TEN
...
0BFA ; [*045B.0020.0002.0BFA] # TAMIL NUMBER SIGN
0C7F ; [*045C.0020.0002.0C7F] # TELUGU SIGN TUUMU
0CF1 ; [*045D.0020.0002.0CF1] # KANNADA SIGN JIHVAMULIYA
0CF2 ; [*045E.0020.0002.0CF2] # KANNADA SIGN UPADHMANIYA
0D70 ; [*111F.0020.0002.0D70] # MALAYALAM NUMBER TEN
...
0D79 ; [*045F.0020.0002.0D79] # MALAYALAM DATE MARK
0DF4 ; [*03A5.0020.0002.0DF4] # SINHALA PUNCTUATION KUNDDALIYA
0E4F ; [*0467.0020.0002.0E4F] # THAI CHARACTER FONGMAN
...
0E5B ; [*03A7.0020.0002.0E5B] # THAI CHARACTER KHOMUT
0F01 ; [*0468.0020.0002.0F01] # TIBETAN MARK GTER YIG MGO TRUNCATED A
...
0FD4 ; [*03BD.0020.0002.0FD4] # TIBETAN MARK CLOSING BRDA RNYING YIG
MGO SGAB MA
0FD5 ; [*048A.0020.0002.0FD5] # RIGHT-FACING SVASTI SIGN
...
0FD8 ; [*048D.0020.0002.0FD8] # LEFT-FACING SVASTI SIGN WITH DOTS
104A ; [*029F.0020.0002.104A] # MYANMAR SIGN LITTLE SECTION
...
109F ; [*03C7.0020.0002.109F] # MYANMAR SYMBOL SHAN EXCLAMATION
10FB ; [*02B7.0020.0002.10FB] # GEORGIAN PARAGRAPH SEPARATOR
1360 ; [*02B8.0020.0002.1360] # ETHIOPIC SECTION MARK
...
1399 ; [*0425.0020.0002.1399] # ETHIOPIC TONAL MARK KURT
1400 ; [*0225.0020.0002.1400] # CANADIAN SYLLABICS HYPHEN
...
166E ; [*0289.0020.0002.166E] # CANADIAN SYLLABICS FULL STOP
1680 ; [*020B.0020.0002.1680] # OGHAM SPACE MARK
...
169C ; [*030A.0020.0002.169C] # OGHAM REVERSED FEATHER MARK
16EB ; [*026A.0020.0002.16EB] # RUNIC SINGLE PUNCTUATION
...
16ED ; [*026C.0020.0002.16ED] # RUNIC CROSS PUNCTUATION
1735 ; [*029C.0020.0002.1735] # PHILIPPINE SINGLE PUNCTUATION
1736 ; [*029D.0020.0002.1736] # PHILIPPINE DOUBLE PUNCTUATION

17D4 ; [*02A1.0020.0002.17D4] # KHMER SIGN KHAN
...
17DA ; [*03CE.0020.0002.17DA] # KHMER SIGN KOOMUUT
1800 ; [*039F.0020.0002.1800] # MONGOLIAN BIRGA
...
180E ; [*0207.0020.0002.180E] # MONGOLIAN VOWEL SEPARATOR
1940 ; [*03C1.0020.0002.1940] # LIMBU SIGN LOO
...
1945 ; [*027A.0020.0002.1945] # LIMBU QUESTION MARK
19E0 ; [*048E.0020.0002.19E0] # KHMER SYMBOL PATHAMASAT
...
19FF ; [*04AD.0020.0002.19FF] # KHMER SYMBOL DAP-PRAM ROC
1A1E ; [*02BA.0020.0002.1A1E] # BUGINESE PALLAWA
1A1F ; [*02BB.0020.0002.1A1F] # BUGINESE END OF SECTION

1AA0 ; [*03CF.0020.0002.1AA0] # TAI THAM SIGN WIANG
...
1AAD ; [*03D7.0020.0002.1AAD] # TAI THAM SIGN CAANG
1B5A ; [*02BC.0020.0002.1B5A] # BALINESE PANTI
...
1B6A ; [*04B7.0020.0002.1B6A] # BALINESE MUSICAL SYMBOL DANG GEDE

1B74 ; [*04B8.0020.0002.1B74] # BALINESE MUSICAL SYMBOL RIGHT-HAND OPEN DUG
...
1B7C ; [*04C0.0020.0002.1B7C] # BALINESE MUSICAL SYMBOL LEFT-HAND OPEN PING
1C3B ; [*0298.0020.0002.1C3B] # LEPCHA PUNCTUATION TA-ROL
...
1C3F ; [*03C0.0020.0002.1C3F] # LEPCHA PUNCTUATION TSHOOK
1C7E ; [*02B3.0020.0002.1C7E] # OL CHIKI PUNCTUATION MUCAAD
1C7F ; [*02B4.0020.0002.1C7F] # OL CHIKI PUNCTUATION DOUBLE MUCAAD

1FBD ; [*0219.0020.0002.1FBD] # GREEK KORONIS; QQC
...
1FFE ; [*021A.0020.0002.1FFE] # GREEK DASIA
2000 ; [*020A.0020.0004.2000] # EN QUAD; QQK
...
200A ; [*020A.0020.0004.200A] # HAIR SPACE; QQK
2010 ; [*0229.0020.0002.2010] # HYPHEN
...
2017 ; [*021E.0020.0002.2017] # DOUBLE LOW LINE
2018 ; [*02EF.0020.0002.2018] # LEFT SINGLE QUOTATION MARK
...
201F ; [*02F9.0020.0002.201F] # DOUBLE HIGH-REVERSED-9 QUOTATION MARK
2020 ; [*036A.0020.0002.2020] # DAGGER
...
2027 ; [*036E.0020.0002.2027] # HYPHENATION POINT
2028 ; [*0208.0020.0002.2028] # LINE SEPARATOR
2029 ; [*0209.0020.0002.2029] # PARAGRAPH SEPARATOR
202F ; [*020A.0020.001B.202F] # NARROW NO-BREAK SPACE; QQK
2030 ; [*0365.0020.0002.2030] # PER MILLE SIGN
2031 ; [*0367.0020.0002.2031] # PER TEN THOUSAND SIGN
2032 ; [*0372.0020.0002.2032] # PRIME
...
2039 ; [*02F3.0020.0002.2039] # SINGLE LEFT-POINTING ANGLE QUOTATION MARK
...
2043 ; [*036F.0020.0002.2043] # HYPHEN BULLET
2044 ; [*035D.0020.0002.2044] # FRACTION SLASH
2045 ; [*030B.0020.0002.2045] # LEFT SQUARE BRACKET WITH QUILL
...
205E ; [*02DB.0020.0002.205E] # VERTICAL FOUR DOTS
205F ; [*020A.0020.0004.205F] # MEDIUM MATHEMATICAL SPACE; QQK

Note that I have abbreviated the list, just to exhibit that there are
a LOT of variable elements in the DUCET.

(unfortunately, the DUCET uses in itself a "strange" mixed order: it
should be ordered itself in the default ("non-ignorable") order, or in
the code point scalar value order to avoid possible duplicates, but
analyzing its content is really tricky. In fact, all collation
graphemes (single characters or the few contractions introduced for
canonical equivalence with non-starter composites or for Thai/Lao
exceptions to the logical order) that are not fully ignorable are
effectively sorted in binary scalar value order, after all the fully
ignorables.)

I'll take some examples of how my proposed new "Separating" mode would
apply to variable elements:

0009 ; [.0201.0201.0020.0002.0009] # HORIZONTAL TABULATION (in 6429)
000A ; [.0201.0202.0020.0002.000A] # LINE FEED (in 6429)
000B ; [.0201.0203.0020.0002.000B] # VERTICAL TABULATION (in 6429)
000C ; [.0201.0204.0020.0002.000C] # FORM FEED (in 6429)
000D ; [.0201.0205.0020.0002.000D] # CARRIAGE RETURN (in 6429)
0020 ; [.0201.020A.0020.0002.0020] # SPACE

00A1 ; [.0201.026F.0020.0002.00A1] # INVERTED EXCLAMATION MARK
...
02FE ; [.0201.041A.0020.0002.02FE] # MODIFIER LETTER OPEN SHELF
02FF ; [.0201.041B.0020.0002.02FF] # MODIFIER LETTER LOW LEFT ARROW

0374 ; [.0201.03E9.0020.0002.0374] # GREEK NUMERAL SIGN; QQC
0375 ; [.0201.03EA.0020.0002.0375] # GREEK LOWER NUMERAL SIGN
037E ; [.0201.0243.0020.0002.037E] # GREEK QUESTION MARK; QQC
0384 ; [.0201.020F.0020.0002.0384] # GREEK TONOS; QQC
0385 ; [.0201.0216.0020.0002.00A8][.0000..0000.0032.0002.0301] #
GREEK DIALYTIKA TONOS; QQCM

And other examples on the non-variable elements which are ignorable AT
LEAST at the primary level:

0000 ; [.0000.0000.0000.0000.0000] # [0000] NULL (in 6429)
0001 ; [.0000.0000.0000.0000.0000] # [0001] START OF HEADING (in 6429)
...
E01EF ; [.0000.0000.0000.0000.0000] # [E01EF] VARIATION SELECTOR-256

0332 ; [.0000.0000.0021.0002.0332] # COMBINING LOW LINE
0313 ; [.0000.0000.0022.0002.0313] # COMBINING COMMA ABOVE
0343 ; [.0000.0000.0022.0002.0343] # COMBINING GREEK KORONIS; QQC
0486 ; [.0000.0000.0022.0002.0486] # COMBINING CYRILLIC PSILI PNEUMATA; QQC
2CF1 ; [.0000.0000.0022.0002.2CF1] # COPTIC COMBINING SPIRITUS LENIS; QQC
0314 ; [.0000.0000.002A.0002.0314] # COMBINING REVERSED COMMA ABOVE
0485 ; [.0000.0000.002A.0002.0485] # COMBINING CYRILLIC DASIA PNEUMATA; QQC
2CF0 ; [.0000.0000.002A.0002.2CF0] # COPTIC COMBINING SPIRITUS ASPER; QQC

And other examples on other non-variable elements that are not
ignorable, or including those with expansions:

1D00 ; [.1213.0020.0002.FFFF.1D00] # LATIN LETTER SMALL CAPITAL A
...
1F00 ; [.1545.0020.0002.FFFF.03B1][.0000.0000.0022.0002.0313] # GREEK
SMALL LETTER ALPHA WITH PSILI; QQCM
1F08 ; [.1545.0020.0008.FFFF.0391][.0000.0000.0022.0002.0313] # GREEK
CAPITAL LETTER ALPHA WITH PSILI; QQCM
1F04 ; [.1545.0020.0002.FFFF.03B1][.0000.0000.0022.0002.0313][.0000.0000.0032.0002.0301]
# GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA; QQCM

Why do I propose this mode ?

Because the existing default "Shifted" mode (or the derived
"Shift-Trimmed" mode for POSIX) is completely impossible to tweak
correctly in tailorings, without lots of corrections.

What I propose is a safer and much simpler way to manage the
tailorings of variable characters that are considered as "word
separators", as they will collate all together at the primary level,
as if they were all SPACEs at this level, without sacrificing
completely their finely-tuned relative order, and without the risk of
forgetting some of them (notably when new version of Unicode add many
new variable elements, notably symbols like pictograms and emojis)

With this mode, it will be MUCH simpler to tailor specially only very
few variable elements so that they will be effectively ignored at the
primary level. For example, in French we just want apostrophes and
parentheses to be treated as ignorable, all the other variable
elements would be treated as separators, that would keep their
relative order, at least in the lower levels, without having to tailor
them specifically. No change will be needed in the DUCET format.

Note for example how the French rules are documented in French
Wiktionnary (for now it does not use UCA, but assigns primary weights
for all variable elements by treating most of them as SPACE, with only
a few of them treated as ignorable, by dropping them from the given
sort key) : there were many months of discussions (and
experimentations or bug reports to solve) to get the correct ordering,
at least at the primary level (the secondary level for diacritics is
NOT supported but partially emulated without support for backward
ordering and with some caveats as it uses the binary ordering, and
only the tertiary level for case differences is almost fully
supported, by sorting all lowercase before all
uppercase/titlecase/final variants).

http://fr.wiktionary.org/wiki/Modèle:clé_de_tri

As MediaWiki is about to introduce UCA-based collation for sorting
categories, this should be done without sacrificing the various issues
that have been solved at the primary level (which is now stable since
2007 and causes problems to nobody, except when handling other
languages than French, because there's no possibility to support the
Spanish 'll' contractions or Catalan 'l·' contractions, or Swedish
'ch' contractions, with a single sort key for multilingual articles,
note also that French order needs absolutely no contraction).

My proposal addresses a general need for the primary level ONLY, other
levels won't be affected within the relative ordering of texts that
sort as equal in this "new" primary level, so we keep all the
advantages of the DUCET. Some needs that it addresses at the primary
level:

- It sorts "aujourd’hui" with "aujourdhui" by only tailoring the
apostrophe on this mode so that they get ignorable (primary weight
0000 instead of 0201 in the new mode)
- It sorts "fleur(s)" with "fleurs" by only tailoring the parentheses
(primary weight 0000 instead of 0201 in the new mode)
- It sorts "km/h" with "km h" without tailoring (primary weight 0201
in the new mode)
- It sorts all symbols with " " without tailoring (primary weight 0201
in the new mode), while keeping all their relative order.

Philippe.

Next message: CE Whitehead: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."
Previous message: Andrew West: "Re: Indian new rupee sign"
Next in thread: Mark Davis â˜•: "Re: UTS#10 (collation) : request for a new "Separating" mode for variable weighting (3.2.2)"
Reply: Mark Davis â˜•: "Re: UTS#10 (collation) : request for a new "Separating" mode for variable weighting (3.2.2)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Jul 31 2010 - 08:23:38 CDT