Re: UTS#10 (collation) : request for a new "Separating" mode for variable weighting (3.2.2)

From: Mark Davis ☕ (mark@macchiato.com)
Date: Sat Jul 31 2010 - 15:50:55 CDT

  • Next message: Asmus Freytag: "Re: High dot/dot above punctuation?"

    I can see the point to your wanting to have another option, but it is
    unclear to me whether sufficiently many people would find that useful as to
    warrant its inclusion. There is also nothing preventing implementations from
    supporting it even if it isn't in the UCA standard.

    In any event, before making such a significant addition, I suspect the
    committee would want to see a functioning implementation for comparison.

    You've only looked at punctuation.It is also worth considering the impact on
    the thousands of symbols that have variable weighting; there are thousands
    under Variable listed in http://unicode.org/charts/collation/. That is,
    looking at cases like {I♥NJ, I♥♪, I♥♘,...} especially with the approaching
    inclusion of Emoji.

    For comparison, I always look at the intermediate generated files for ICU
    rather than the UCA files directly: http://macchiato.com/utc/uca/. For
    example, the UCA_Rules_NoCE.txt files for releases can be compared
    effectively, because the weights are not included.

    Mark

    *— Il meglio è l’inimico del bene —*

    2010/7/31 Philippe Verdy <verdy_p@wanadoo.fr>

    > Something is not considered for now in UCA, the special behavior
    > introduced with Variable Weighting (UTS#10 3.2.2 Variable Weighting)
    > is too simplistic and quite confusive.
    >
    > First.
    >
    > The term chosen for the "Blanked" option is really confusive, when it
    > does not mean that these variable elements are not all trated like
    > blanks (with minor differences) but effectively "Ignored" (only
    > considered at the final implicit level for binary ordering of scalar
    > values, independantly of the selected collation level, and which is
    > still needed for sort stability). Could we change this term, without
    > changing its definition ?
    >
    > Second.
    >
    > The default ("Non-Ignorable") makes primary differences between all
    > the variable elements. For various reasons, there should exist a way
    > to ignore these differences at the primary level, even if these
    > variable characters are not ignored.
    >
    > UCA proposes the alternative "Shifted" option, but its effect is to
    > make ALL these differences become level 4 differences only, droppping
    > all the specially tuned sublevels defined in the DUCET, by only
    > considering the primary level to create Level 4.
    >
    > It also proposes the "Shift-Trimmed" option which has basically the
    > same effect (making variable elements having only level-4 differences
    > and forgetting completely how they are sorted), except that it's just
    > to emulate a POSIX behavior for NON-VARIABLE elements. This behavior
    > is more a bug of old POSIX implementations, and this should not even
    > be accepted as a standard behavior for conforming UCA. Then UCA is
    > transforming this bug into an acceptable behavior, still this is a
    > hack added on top of the "Shifted" option, which should not be
    > encouraged.
    >
    > Third (the longest point).
    >
    > So I propose another option: "Separating"
    > - Its effect would be to keep ALL the primary, secondary and ternary
    > weights of "Variable", and all "Ignorable" elements
    > (primary-ignorable), but shifting them by only one level up.
    > - A single primary weight would be inserted, and it would be any one
    > of the primary weights assigned by default to variable elements (but
    > the same for all variable elements, for example it could be 0209,
    > assigned as the primary weight for SPACE, or 0201 assigned to
    > HORIZONTAL TABULATION which is the first Variable element).
    > - The single primary weight inserted for ignorable characters (with
    > zero primary weight) would be also 0000.
    > - On non-variable elements (digits, letters, sinograms...) the fourth
    > level created is assigned weight FFFF (in a way similar to the
    > "Shifted" option)
    >
    > However, it has an effect on the semantic of Level 2 collation
    > (ignoring case) because there would be small secondary differences
    > within variable elements that would gain a non-zero weight at level 3.
    > Note that the DUCET already assigns non-zero weights to variable
    > elements:
    >
    > 0009 ; [*0201.0020.0002.0009] # HORIZONTAL TABULATION (in 6429)
    > 000A ; [*0202.0020.0002.000A] # LINE FEED (in 6429)
    > 000B ; [*0203.0020.0002.000B] # VERTICAL TABULATION (in 6429)
    > 000C ; [*0204.0020.0002.000C] # FORM FEED (in 6429)
    > 000D ; [*0205.0020.0002.000D] # CARRIAGE RETURN (in 6429)
    > 0020 ; [*020A.0020.0002.0020] # SPACE
    > ...
    > 0085 ; [*0206.0020.0002.0085] # NEXT LINE (in 6429)
    > 00A0 ; [*020A.0020.001B.00A0] # NO-BREAK SPACE; QQK
    > 00A1 ; [*026F.0020.0002.00A1] # INVERTED EXCLAMATION MARK
    > ...
    > 02FE ; [*041A.0020.0002.02FE] # MODIFIER LETTER OPEN SHELF
    > 02FF ; [*041B.0020.0002.02FF] # MODIFIER LETTER LOW LEFT ARROW
    >
    > 0374 ; [*03E9.0020.0002.0374] # GREEK NUMERAL SIGN; QQC
    > 0375 ; [*03EA.0020.0002.0375] # GREEK LOWER NUMERAL SIGN
    > 037E ; [*0243.0020.0002.037E] # GREEK QUESTION MARK; QQC
    > 0384 ; [*020F.0020.0002.0384] # GREEK TONOS; QQC
    > 0385 ; [*0216.0020.0002.00A8][.0000.0032.0002.0301] # GREEK DIALYTIKA
    > TONOS; QQCM
    > 0387 ; [*0292.0020.0002.0387] # GREEK ANO TELEIA; QQC
    > 03F6 ; [*054B.0020.0002.03F6] # GREEK REVERSED LUNATE EPSILON SYMBOL
    > 0482 ; [*044C.0020.0002.0482] # CYRILLIC THOUSANDS SIGN
    >
    > 055A ; [*0392.0020.0002.055A] # ARMENIAN APOSTROPHE
    > 055B ; [*0393.0020.0002.055B] # ARMENIAN EMPHASIS MARK
    > 055C ; [*0270.0020.0002.055C] # ARMENIAN EXCLAMATION MARK
    > 055D ; [*0235.0020.0002.055D] # ARMENIAN COMMA
    > 055E ; [*0276.0020.0002.055E] # ARMENIAN QUESTION MARK
    > 055F ; [*0394.0020.0002.055F] # ARMENIAN ABBREVIATION MARK
    > 0589 ; [*0248.0020.0002.0589] # ARMENIAN FULL STOP
    > 058A ; [*0224.0020.0002.058A] # ARMENIAN HYPHEN
    >
    > 05BE ; [*0395.0020.0002.05BE] # HEBREW PUNCTUATION MAQAF
    > 05C0 ; [*0396.0020.0002.05C0] # HEBREW PUNCTUATION PASEQ
    > 05C3 ; [*0397.0020.0002.05C3] # HEBREW PUNCTUATION SOF PASUQ
    >
    > 05C6 ; [*0398.0020.0002.05C6] # HEBREW PUNCTUATION NUN HAFUKHA
    > 05F3 ; [*0399.0020.0002.05F3] # HEBREW PUNCTUATION GERESH
    > 05F4 ; [*039A.0020.0002.05F4] # HEBREW PUNCTUATION GERSHAYIM
    >
    > 0606 ; [*0566.0020.0002.0606] # ARABIC-INDIC CUBE ROOT
    > ...
    > 060F ; [*044F.0020.0002.060F] # ARABIC SIGN MISRA
    >
    > 061B ; [*0244.0020.0002.061B] # ARABIC SEMICOLON
    > ...
    > 06D4 ; [*0283.0020.0002.06D4] # ARABIC FULL STOP
    >
    > 06E9 ; [*0450.0020.0002.06E9] # ARABIC PLACE OF SAJDAH
    >
    > 0700 ; [*02B5.0020.0002.0700] # SYRIAC END OF PARAGRAPH
    > ...
    > 070D ; [*039E.0020.0002.070D] # SYRIAC HARKLEAN ASTERISCUS
    >
    > 07F6 ; [*0452.0020.0002.07F6] # NKO SYMBOL OO DENNEN
    > ...
    > 07FA ; [*020D.0020.0002.07FA] # NKO LAJANYALAN
    > 0830 ; [*0250.0020.0002.0830] # SAMARITAN PUNCTUATION NEQUDAA
    > ...
    > 083E ; [*025E.0020.0002.083E] # SAMARITAN PUNCTUATION ANNAAU
    >
    > 0964 ; [*0294.0020.0002.0964] # DEVANAGARI DANDA
    > ...
    > 0970 ; [*03A1.0020.0002.0970] # DEVANAGARI ABBREVIATION SIGN
    > 09F4 ; [*1110.0020.0002.09F4] # BENGALI CURRENCY NUMERATOR ONE
    > ...
    > 09FA ; [*0453.0020.0002.09FA] # BENGALI ISSHAR
    > 0B70 ; [*0454.0020.0002.0B70] # ORIYA ISSHAR
    > 0BF0 ; [*111C.0020.0002.0BF0] # TAMIL NUMBER TEN
    > ...
    > 0BFA ; [*045B.0020.0002.0BFA] # TAMIL NUMBER SIGN
    > 0C7F ; [*045C.0020.0002.0C7F] # TELUGU SIGN TUUMU
    > 0CF1 ; [*045D.0020.0002.0CF1] # KANNADA SIGN JIHVAMULIYA
    > 0CF2 ; [*045E.0020.0002.0CF2] # KANNADA SIGN UPADHMANIYA
    > 0D70 ; [*111F.0020.0002.0D70] # MALAYALAM NUMBER TEN
    > ...
    > 0D79 ; [*045F.0020.0002.0D79] # MALAYALAM DATE MARK
    > 0DF4 ; [*03A5.0020.0002.0DF4] # SINHALA PUNCTUATION KUNDDALIYA
    > 0E4F ; [*0467.0020.0002.0E4F] # THAI CHARACTER FONGMAN
    > ...
    > 0E5B ; [*03A7.0020.0002.0E5B] # THAI CHARACTER KHOMUT
    > 0F01 ; [*0468.0020.0002.0F01] # TIBETAN MARK GTER YIG MGO TRUNCATED A
    > ...
    > 0FD4 ; [*03BD.0020.0002.0FD4] # TIBETAN MARK CLOSING BRDA RNYING YIG
    > MGO SGAB MA
    > 0FD5 ; [*048A.0020.0002.0FD5] # RIGHT-FACING SVASTI SIGN
    > ...
    > 0FD8 ; [*048D.0020.0002.0FD8] # LEFT-FACING SVASTI SIGN WITH DOTS
    > 104A ; [*029F.0020.0002.104A] # MYANMAR SIGN LITTLE SECTION
    > ...
    > 109F ; [*03C7.0020.0002.109F] # MYANMAR SYMBOL SHAN EXCLAMATION
    > 10FB ; [*02B7.0020.0002.10FB] # GEORGIAN PARAGRAPH SEPARATOR
    > 1360 ; [*02B8.0020.0002.1360] # ETHIOPIC SECTION MARK
    > ...
    > 1399 ; [*0425.0020.0002.1399] # ETHIOPIC TONAL MARK KURT
    > 1400 ; [*0225.0020.0002.1400] # CANADIAN SYLLABICS HYPHEN
    > ...
    > 166E ; [*0289.0020.0002.166E] # CANADIAN SYLLABICS FULL STOP
    > 1680 ; [*020B.0020.0002.1680] # OGHAM SPACE MARK
    > ...
    > 169C ; [*030A.0020.0002.169C] # OGHAM REVERSED FEATHER MARK
    > 16EB ; [*026A.0020.0002.16EB] # RUNIC SINGLE PUNCTUATION
    > ...
    > 16ED ; [*026C.0020.0002.16ED] # RUNIC CROSS PUNCTUATION
    > 1735 ; [*029C.0020.0002.1735] # PHILIPPINE SINGLE PUNCTUATION
    > 1736 ; [*029D.0020.0002.1736] # PHILIPPINE DOUBLE PUNCTUATION
    >
    > 17D4 ; [*02A1.0020.0002.17D4] # KHMER SIGN KHAN
    > ...
    > 17DA ; [*03CE.0020.0002.17DA] # KHMER SIGN KOOMUUT
    > 1800 ; [*039F.0020.0002.1800] # MONGOLIAN BIRGA
    > ...
    > 180E ; [*0207.0020.0002.180E] # MONGOLIAN VOWEL SEPARATOR
    > 1940 ; [*03C1.0020.0002.1940] # LIMBU SIGN LOO
    > ...
    > 1945 ; [*027A.0020.0002.1945] # LIMBU QUESTION MARK
    > 19E0 ; [*048E.0020.0002.19E0] # KHMER SYMBOL PATHAMASAT
    > ...
    > 19FF ; [*04AD.0020.0002.19FF] # KHMER SYMBOL DAP-PRAM ROC
    > 1A1E ; [*02BA.0020.0002.1A1E] # BUGINESE PALLAWA
    > 1A1F ; [*02BB.0020.0002.1A1F] # BUGINESE END OF SECTION
    >
    > 1AA0 ; [*03CF.0020.0002.1AA0] # TAI THAM SIGN WIANG
    > ...
    > 1AAD ; [*03D7.0020.0002.1AAD] # TAI THAM SIGN CAANG
    > 1B5A ; [*02BC.0020.0002.1B5A] # BALINESE PANTI
    > ...
    > 1B6A ; [*04B7.0020.0002.1B6A] # BALINESE MUSICAL SYMBOL DANG GEDE
    >
    > 1B74 ; [*04B8.0020.0002.1B74] # BALINESE MUSICAL SYMBOL RIGHT-HAND OPEN
    > DUG
    > ...
    > 1B7C ; [*04C0.0020.0002.1B7C] # BALINESE MUSICAL SYMBOL LEFT-HAND OPEN
    > PING
    > 1C3B ; [*0298.0020.0002.1C3B] # LEPCHA PUNCTUATION TA-ROL
    > ...
    > 1C3F ; [*03C0.0020.0002.1C3F] # LEPCHA PUNCTUATION TSHOOK
    > 1C7E ; [*02B3.0020.0002.1C7E] # OL CHIKI PUNCTUATION MUCAAD
    > 1C7F ; [*02B4.0020.0002.1C7F] # OL CHIKI PUNCTUATION DOUBLE MUCAAD
    >
    > 1FBD ; [*0219.0020.0002.1FBD] # GREEK KORONIS; QQC
    > ...
    > 1FFE ; [*021A.0020.0002.1FFE] # GREEK DASIA
    > 2000 ; [*020A.0020.0004.2000] # EN QUAD; QQK
    > ...
    > 200A ; [*020A.0020.0004.200A] # HAIR SPACE; QQK
    > 2010 ; [*0229.0020.0002.2010] # HYPHEN
    > ...
    > 2017 ; [*021E.0020.0002.2017] # DOUBLE LOW LINE
    > 2018 ; [*02EF.0020.0002.2018] # LEFT SINGLE QUOTATION MARK
    > ...
    > 201F ; [*02F9.0020.0002.201F] # DOUBLE HIGH-REVERSED-9 QUOTATION MARK
    > 2020 ; [*036A.0020.0002.2020] # DAGGER
    > ...
    > 2027 ; [*036E.0020.0002.2027] # HYPHENATION POINT
    > 2028 ; [*0208.0020.0002.2028] # LINE SEPARATOR
    > 2029 ; [*0209.0020.0002.2029] # PARAGRAPH SEPARATOR
    > 202F ; [*020A.0020.001B.202F] # NARROW NO-BREAK SPACE; QQK
    > 2030 ; [*0365.0020.0002.2030] # PER MILLE SIGN
    > 2031 ; [*0367.0020.0002.2031] # PER TEN THOUSAND SIGN
    > 2032 ; [*0372.0020.0002.2032] # PRIME
    > ...
    > 2039 ; [*02F3.0020.0002.2039] # SINGLE LEFT-POINTING ANGLE QUOTATION MARK
    > ...
    > 2043 ; [*036F.0020.0002.2043] # HYPHEN BULLET
    > 2044 ; [*035D.0020.0002.2044] # FRACTION SLASH
    > 2045 ; [*030B.0020.0002.2045] # LEFT SQUARE BRACKET WITH QUILL
    > ...
    > 205E ; [*02DB.0020.0002.205E] # VERTICAL FOUR DOTS
    > 205F ; [*020A.0020.0004.205F] # MEDIUM MATHEMATICAL SPACE; QQK
    >
    > Note that I have abbreviated the list, just to exhibit that there are
    > a LOT of variable elements in the DUCET.
    >
    > (unfortunately, the DUCET uses in itself a "strange" mixed order: it
    > should be ordered itself in the default ("non-ignorable") order, or in
    > the code point scalar value order to avoid possible duplicates, but
    > analyzing its content is really tricky. In fact, all collation
    > graphemes (single characters or the few contractions introduced for
    > canonical equivalence with non-starter composites or for Thai/Lao
    > exceptions to the logical order) that are not fully ignorable are
    > effectively sorted in binary scalar value order, after all the fully
    > ignorables.)
    >
    > I'll take some examples of how my proposed new "Separating" mode would
    > apply to variable elements:
    >
    > 0009 ; [.0201.0201.0020.0002.0009] # HORIZONTAL TABULATION (in 6429)
    > 000A ; [.0201.0202.0020.0002.000A] # LINE FEED (in 6429)
    > 000B ; [.0201.0203.0020.0002.000B] # VERTICAL TABULATION (in 6429)
    > 000C ; [.0201.0204.0020.0002.000C] # FORM FEED (in 6429)
    > 000D ; [.0201.0205.0020.0002.000D] # CARRIAGE RETURN (in 6429)
    > 0020 ; [.0201.020A.0020.0002.0020] # SPACE
    >
    > 00A1 ; [.0201.026F.0020.0002.00A1] # INVERTED EXCLAMATION MARK
    > ...
    > 02FE ; [.0201.041A.0020.0002.02FE] # MODIFIER LETTER OPEN SHELF
    > 02FF ; [.0201.041B.0020.0002.02FF] # MODIFIER LETTER LOW LEFT ARROW
    >
    > 0374 ; [.0201.03E9.0020.0002.0374] # GREEK NUMERAL SIGN; QQC
    > 0375 ; [.0201.03EA.0020.0002.0375] # GREEK LOWER NUMERAL SIGN
    > 037E ; [.0201.0243.0020.0002.037E] # GREEK QUESTION MARK; QQC
    > 0384 ; [.0201.020F.0020.0002.0384] # GREEK TONOS; QQC
    > 0385 ; [.0201.0216.0020.0002.00A8][.0000..0000.0032.0002.0301] #
    > GREEK DIALYTIKA TONOS; QQCM
    >
    > And other examples on the non-variable elements which are ignorable AT
    > LEAST at the primary level:
    >
    > 0000 ; [.0000.0000.0000.0000.0000] # [0000] NULL (in 6429)
    > 0001 ; [.0000.0000.0000.0000.0000] # [0001] START OF HEADING (in 6429)
    > ...
    > E01EF ; [.0000.0000.0000.0000.0000] # [E01EF] VARIATION SELECTOR-256
    >
    > 0332 ; [.0000.0000.0021.0002.0332] # COMBINING LOW LINE
    > 0313 ; [.0000.0000.0022.0002.0313] # COMBINING COMMA ABOVE
    > 0343 ; [.0000.0000.0022.0002.0343] # COMBINING GREEK KORONIS; QQC
    > 0486 ; [.0000.0000.0022.0002.0486] # COMBINING CYRILLIC PSILI PNEUMATA;
    > QQC
    > 2CF1 ; [.0000.0000.0022.0002.2CF1] # COPTIC COMBINING SPIRITUS LENIS; QQC
    > 0314 ; [.0000.0000.002A.0002.0314] # COMBINING REVERSED COMMA ABOVE
    > 0485 ; [.0000.0000.002A.0002.0485] # COMBINING CYRILLIC DASIA PNEUMATA;
    > QQC
    > 2CF0 ; [.0000.0000.002A.0002.2CF0] # COPTIC COMBINING SPIRITUS ASPER; QQC
    >
    > And other examples on other non-variable elements that are not
    > ignorable, or including those with expansions:
    >
    > 1D00 ; [.1213.0020.0002.FFFF.1D00] # LATIN LETTER SMALL CAPITAL A
    > ...
    > 1F00 ; [.1545.0020.0002.FFFF.03B1][.0000.0000.0022.0002.0313] # GREEK
    > SMALL LETTER ALPHA WITH PSILI; QQCM
    > 1F08 ; [.1545.0020.0008.FFFF.0391][.0000.0000.0022.0002.0313] # GREEK
    > CAPITAL LETTER ALPHA WITH PSILI; QQCM
    > 1F04 ;
    > [.1545.0020.0002.FFFF.03B1][.0000.0000.0022.0002.0313][.0000.0000.0032.0002.0301]
    > # GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA; QQCM
    >
    > Why do I propose this mode ?
    >
    > Because the existing default "Shifted" mode (or the derived
    > "Shift-Trimmed" mode for POSIX) is completely impossible to tweak
    > correctly in tailorings, without lots of corrections.
    >
    > What I propose is a safer and much simpler way to manage the
    > tailorings of variable characters that are considered as "word
    > separators", as they will collate all together at the primary level,
    > as if they were all SPACEs at this level, without sacrificing
    > completely their finely-tuned relative order, and without the risk of
    > forgetting some of them (notably when new version of Unicode add many
    > new variable elements, notably symbols like pictograms and emojis)
    >
    > With this mode, it will be MUCH simpler to tailor specially only very
    > few variable elements so that they will be effectively ignored at the
    > primary level. For example, in French we just want apostrophes and
    > parentheses to be treated as ignorable, all the other variable
    > elements would be treated as separators, that would keep their
    > relative order, at least in the lower levels, without having to tailor
    > them specifically. No change will be needed in the DUCET format.
    >
    > Note for example how the French rules are documented in French
    > Wiktionnary (for now it does not use UCA, but assigns primary weights
    > for all variable elements by treating most of them as SPACE, with only
    > a few of them treated as ignorable, by dropping them from the given
    > sort key) : there were many months of discussions (and
    > experimentations or bug reports to solve) to get the correct ordering,
    > at least at the primary level (the secondary level for diacritics is
    > NOT supported but partially emulated without support for backward
    > ordering and with some caveats as it uses the binary ordering, and
    > only the tertiary level for case differences is almost fully
    > supported, by sorting all lowercase before all
    > uppercase/titlecase/final variants).
    >
    > http://fr.wiktionary.org/wiki/Modèle:clé_de_tri>
    >
    > As MediaWiki is about to introduce UCA-based collation for sorting
    > categories, this should be done without sacrificing the various issues
    > that have been solved at the primary level (which is now stable since
    > 2007 and causes problems to nobody, except when handling other
    > languages than French, because there's no possibility to support the
    > Spanish 'll' contractions or Catalan 'l·' contractions, or Swedish
    > 'ch' contractions, with a single sort key for multilingual articles,
    > note also that French order needs absolutely no contraction).
    >
    > My proposal addresses a general need for the primary level ONLY, other
    > levels won't be affected within the relative ordering of texts that
    > sort as equal in this "new" primary level, so we keep all the
    > advantages of the DUCET. Some needs that it addresses at the primary
    > level:
    >
    > - It sorts "aujourd’hui" with "aujourdhui" by only tailoring the
    > apostrophe on this mode so that they get ignorable (primary weight
    > 0000 instead of 0201 in the new mode)
    > - It sorts "fleur(s)" with "fleurs" by only tailoring the parentheses
    > (primary weight 0000 instead of 0201 in the new mode)
    > - It sorts "km/h" with "km h" without tailoring (primary weight 0201
    > in the new mode)
    > - It sorts all symbols with " " without tailoring (primary weight 0201
    > in the new mode), while keeping all their relative order.
    >
    > Philippe.
    >



    This archive was generated by hypermail 2.1.5 : Sat Jul 31 2010 - 15:54:05 CDT