UTS#10 (collation) : request for a new "Separating" mode for variable weighting (3.2.2)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Jul 31 2010 - 08:19:01 CDT

  • Next message: CE Whitehead: "RE: UTS#10 (collation) : French backwards level 2, and word-breakers."

    Something is not considered for now in UCA, the special behavior
    introduced with Variable Weighting (UTS#10 3.2.2 Variable Weighting)
    is too simplistic and quite confusive.

    First.

    The term chosen for the "Blanked" option is really confusive, when it
    does not mean that these variable elements are not all trated like
    blanks (with minor differences) but effectively "Ignored" (only
    considered at the final implicit level for binary ordering of scalar
    values, independantly of the selected collation level, and which is
    still needed for sort stability). Could we change this term, without
    changing its definition ?

    Second.

    The default ("Non-Ignorable") makes primary differences between all
    the variable elements. For various reasons, there should exist a way
    to ignore these differences at the primary level, even if these
    variable characters are not ignored.

    UCA proposes the alternative "Shifted" option, but its effect is to
    make ALL these differences become level 4 differences only, droppping
    all the specially tuned sublevels defined in the DUCET, by only
    considering the primary level to create Level 4.

    It also proposes the "Shift-Trimmed" option which has basically the
    same effect (making variable elements having only level-4 differences
    and forgetting completely how they are sorted), except that it's just
    to emulate a POSIX behavior for NON-VARIABLE elements. This behavior
    is more a bug of old POSIX implementations, and this should not even
    be accepted as a standard behavior for conforming UCA. Then UCA is
    transforming this bug into an acceptable behavior, still this is a
    hack added on top of the "Shifted" option, which should not be
    encouraged.

    Third (the longest point).

    So I propose another option: "Separating"
    - Its effect would be to keep ALL the primary, secondary and ternary
    weights of "Variable", and all "Ignorable" elements
    (primary-ignorable), but shifting them by only one level up.
    - A single primary weight would be inserted, and it would be any one
    of the primary weights assigned by default to variable elements (but
    the same for all variable elements, for example it could be 0209,
    assigned as the primary weight for SPACE, or 0201 assigned to
    HORIZONTAL TABULATION which is the first Variable element).
    - The single primary weight inserted for ignorable characters (with
    zero primary weight) would be also 0000.
    - On non-variable elements (digits, letters, sinograms...) the fourth
    level created is assigned weight FFFF (in a way similar to the
    "Shifted" option)

    However, it has an effect on the semantic of Level 2 collation
    (ignoring case) because there would be small secondary differences
    within variable elements that would gain a non-zero weight at level 3.
    Note that the DUCET already assigns non-zero weights to variable
    elements:

    0009 ; [*0201.0020.0002.0009] # HORIZONTAL TABULATION (in 6429)
    000A ; [*0202.0020.0002.000A] # LINE FEED (in 6429)
    000B ; [*0203.0020.0002.000B] # VERTICAL TABULATION (in 6429)
    000C ; [*0204.0020.0002.000C] # FORM FEED (in 6429)
    000D ; [*0205.0020.0002.000D] # CARRIAGE RETURN (in 6429)
    0020 ; [*020A.0020.0002.0020] # SPACE
    ...
    0085 ; [*0206.0020.0002.0085] # NEXT LINE (in 6429)
    00A0 ; [*020A.0020.001B.00A0] # NO-BREAK SPACE; QQK
    00A1 ; [*026F.0020.0002.00A1] # INVERTED EXCLAMATION MARK
    ...
    02FE ; [*041A.0020.0002.02FE] # MODIFIER LETTER OPEN SHELF
    02FF ; [*041B.0020.0002.02FF] # MODIFIER LETTER LOW LEFT ARROW

    0374 ; [*03E9.0020.0002.0374] # GREEK NUMERAL SIGN; QQC
    0375 ; [*03EA.0020.0002.0375] # GREEK LOWER NUMERAL SIGN
    037E ; [*0243.0020.0002.037E] # GREEK QUESTION MARK; QQC
    0384 ; [*020F.0020.0002.0384] # GREEK TONOS; QQC
    0385 ; [*0216.0020.0002.00A8][.0000.0032.0002.0301] # GREEK DIALYTIKA
    TONOS; QQCM
    0387 ; [*0292.0020.0002.0387] # GREEK ANO TELEIA; QQC
    03F6 ; [*054B.0020.0002.03F6] # GREEK REVERSED LUNATE EPSILON SYMBOL
    0482 ; [*044C.0020.0002.0482] # CYRILLIC THOUSANDS SIGN

    055A ; [*0392.0020.0002.055A] # ARMENIAN APOSTROPHE
    055B ; [*0393.0020.0002.055B] # ARMENIAN EMPHASIS MARK
    055C ; [*0270.0020.0002.055C] # ARMENIAN EXCLAMATION MARK
    055D ; [*0235.0020.0002.055D] # ARMENIAN COMMA
    055E ; [*0276.0020.0002.055E] # ARMENIAN QUESTION MARK
    055F ; [*0394.0020.0002.055F] # ARMENIAN ABBREVIATION MARK
    0589 ; [*0248.0020.0002.0589] # ARMENIAN FULL STOP
    058A ; [*0224.0020.0002.058A] # ARMENIAN HYPHEN

    05BE ; [*0395.0020.0002.05BE] # HEBREW PUNCTUATION MAQAF
    05C0 ; [*0396.0020.0002.05C0] # HEBREW PUNCTUATION PASEQ
    05C3 ; [*0397.0020.0002.05C3] # HEBREW PUNCTUATION SOF PASUQ

    05C6 ; [*0398.0020.0002.05C6] # HEBREW PUNCTUATION NUN HAFUKHA
    05F3 ; [*0399.0020.0002.05F3] # HEBREW PUNCTUATION GERESH
    05F4 ; [*039A.0020.0002.05F4] # HEBREW PUNCTUATION GERSHAYIM

    0606 ; [*0566.0020.0002.0606] # ARABIC-INDIC CUBE ROOT
    ...
    060F ; [*044F.0020.0002.060F] # ARABIC SIGN MISRA

    061B ; [*0244.0020.0002.061B] # ARABIC SEMICOLON
    ...
    06D4 ; [*0283.0020.0002.06D4] # ARABIC FULL STOP

    06E9 ; [*0450.0020.0002.06E9] # ARABIC PLACE OF SAJDAH

    0700 ; [*02B5.0020.0002.0700] # SYRIAC END OF PARAGRAPH
    ...
    070D ; [*039E.0020.0002.070D] # SYRIAC HARKLEAN ASTERISCUS

    07F6 ; [*0452.0020.0002.07F6] # NKO SYMBOL OO DENNEN
    ...
    07FA ; [*020D.0020.0002.07FA] # NKO LAJANYALAN
    0830 ; [*0250.0020.0002.0830] # SAMARITAN PUNCTUATION NEQUDAA
    ...
    083E ; [*025E.0020.0002.083E] # SAMARITAN PUNCTUATION ANNAAU

    0964 ; [*0294.0020.0002.0964] # DEVANAGARI DANDA
    ...
    0970 ; [*03A1.0020.0002.0970] # DEVANAGARI ABBREVIATION SIGN
    09F4 ; [*1110.0020.0002.09F4] # BENGALI CURRENCY NUMERATOR ONE
    ...
    09FA ; [*0453.0020.0002.09FA] # BENGALI ISSHAR
    0B70 ; [*0454.0020.0002.0B70] # ORIYA ISSHAR
    0BF0 ; [*111C.0020.0002.0BF0] # TAMIL NUMBER TEN
    ...
    0BFA ; [*045B.0020.0002.0BFA] # TAMIL NUMBER SIGN
    0C7F ; [*045C.0020.0002.0C7F] # TELUGU SIGN TUUMU
    0CF1 ; [*045D.0020.0002.0CF1] # KANNADA SIGN JIHVAMULIYA
    0CF2 ; [*045E.0020.0002.0CF2] # KANNADA SIGN UPADHMANIYA
    0D70 ; [*111F.0020.0002.0D70] # MALAYALAM NUMBER TEN
    ...
    0D79 ; [*045F.0020.0002.0D79] # MALAYALAM DATE MARK
    0DF4 ; [*03A5.0020.0002.0DF4] # SINHALA PUNCTUATION KUNDDALIYA
    0E4F ; [*0467.0020.0002.0E4F] # THAI CHARACTER FONGMAN
    ...
    0E5B ; [*03A7.0020.0002.0E5B] # THAI CHARACTER KHOMUT
    0F01 ; [*0468.0020.0002.0F01] # TIBETAN MARK GTER YIG MGO TRUNCATED A
    ...
    0FD4 ; [*03BD.0020.0002.0FD4] # TIBETAN MARK CLOSING BRDA RNYING YIG
    MGO SGAB MA
    0FD5 ; [*048A.0020.0002.0FD5] # RIGHT-FACING SVASTI SIGN
    ...
    0FD8 ; [*048D.0020.0002.0FD8] # LEFT-FACING SVASTI SIGN WITH DOTS
    104A ; [*029F.0020.0002.104A] # MYANMAR SIGN LITTLE SECTION
    ...
    109F ; [*03C7.0020.0002.109F] # MYANMAR SYMBOL SHAN EXCLAMATION
    10FB ; [*02B7.0020.0002.10FB] # GEORGIAN PARAGRAPH SEPARATOR
    1360 ; [*02B8.0020.0002.1360] # ETHIOPIC SECTION MARK
    ...
    1399 ; [*0425.0020.0002.1399] # ETHIOPIC TONAL MARK KURT
    1400 ; [*0225.0020.0002.1400] # CANADIAN SYLLABICS HYPHEN
    ...
    166E ; [*0289.0020.0002.166E] # CANADIAN SYLLABICS FULL STOP
    1680 ; [*020B.0020.0002.1680] # OGHAM SPACE MARK
    ...
    169C ; [*030A.0020.0002.169C] # OGHAM REVERSED FEATHER MARK
    16EB ; [*026A.0020.0002.16EB] # RUNIC SINGLE PUNCTUATION
    ...
    16ED ; [*026C.0020.0002.16ED] # RUNIC CROSS PUNCTUATION
    1735 ; [*029C.0020.0002.1735] # PHILIPPINE SINGLE PUNCTUATION
    1736 ; [*029D.0020.0002.1736] # PHILIPPINE DOUBLE PUNCTUATION

    17D4 ; [*02A1.0020.0002.17D4] # KHMER SIGN KHAN
    ...
    17DA ; [*03CE.0020.0002.17DA] # KHMER SIGN KOOMUUT
    1800 ; [*039F.0020.0002.1800] # MONGOLIAN BIRGA
    ...
    180E ; [*0207.0020.0002.180E] # MONGOLIAN VOWEL SEPARATOR
    1940 ; [*03C1.0020.0002.1940] # LIMBU SIGN LOO
    ...
    1945 ; [*027A.0020.0002.1945] # LIMBU QUESTION MARK
    19E0 ; [*048E.0020.0002.19E0] # KHMER SYMBOL PATHAMASAT
    ...
    19FF ; [*04AD.0020.0002.19FF] # KHMER SYMBOL DAP-PRAM ROC
    1A1E ; [*02BA.0020.0002.1A1E] # BUGINESE PALLAWA
    1A1F ; [*02BB.0020.0002.1A1F] # BUGINESE END OF SECTION

    1AA0 ; [*03CF.0020.0002.1AA0] # TAI THAM SIGN WIANG
    ...
    1AAD ; [*03D7.0020.0002.1AAD] # TAI THAM SIGN CAANG
    1B5A ; [*02BC.0020.0002.1B5A] # BALINESE PANTI
    ...
    1B6A ; [*04B7.0020.0002.1B6A] # BALINESE MUSICAL SYMBOL DANG GEDE

    1B74 ; [*04B8.0020.0002.1B74] # BALINESE MUSICAL SYMBOL RIGHT-HAND OPEN DUG
    ...
    1B7C ; [*04C0.0020.0002.1B7C] # BALINESE MUSICAL SYMBOL LEFT-HAND OPEN PING
    1C3B ; [*0298.0020.0002.1C3B] # LEPCHA PUNCTUATION TA-ROL
    ...
    1C3F ; [*03C0.0020.0002.1C3F] # LEPCHA PUNCTUATION TSHOOK
    1C7E ; [*02B3.0020.0002.1C7E] # OL CHIKI PUNCTUATION MUCAAD
    1C7F ; [*02B4.0020.0002.1C7F] # OL CHIKI PUNCTUATION DOUBLE MUCAAD

    1FBD ; [*0219.0020.0002.1FBD] # GREEK KORONIS; QQC
    ...
    1FFE ; [*021A.0020.0002.1FFE] # GREEK DASIA
    2000 ; [*020A.0020.0004.2000] # EN QUAD; QQK
    ...
    200A ; [*020A.0020.0004.200A] # HAIR SPACE; QQK
    2010 ; [*0229.0020.0002.2010] # HYPHEN
    ...
    2017 ; [*021E.0020.0002.2017] # DOUBLE LOW LINE
    2018 ; [*02EF.0020.0002.2018] # LEFT SINGLE QUOTATION MARK
    ...
    201F ; [*02F9.0020.0002.201F] # DOUBLE HIGH-REVERSED-9 QUOTATION MARK
    2020 ; [*036A.0020.0002.2020] # DAGGER
    ...
    2027 ; [*036E.0020.0002.2027] # HYPHENATION POINT
    2028 ; [*0208.0020.0002.2028] # LINE SEPARATOR
    2029 ; [*0209.0020.0002.2029] # PARAGRAPH SEPARATOR
    202F ; [*020A.0020.001B.202F] # NARROW NO-BREAK SPACE; QQK
    2030 ; [*0365.0020.0002.2030] # PER MILLE SIGN
    2031 ; [*0367.0020.0002.2031] # PER TEN THOUSAND SIGN
    2032 ; [*0372.0020.0002.2032] # PRIME
    ...
    2039 ; [*02F3.0020.0002.2039] # SINGLE LEFT-POINTING ANGLE QUOTATION MARK
    ...
    2043 ; [*036F.0020.0002.2043] # HYPHEN BULLET
    2044 ; [*035D.0020.0002.2044] # FRACTION SLASH
    2045 ; [*030B.0020.0002.2045] # LEFT SQUARE BRACKET WITH QUILL
    ...
    205E ; [*02DB.0020.0002.205E] # VERTICAL FOUR DOTS
    205F ; [*020A.0020.0004.205F] # MEDIUM MATHEMATICAL SPACE; QQK

    Note that I have abbreviated the list, just to exhibit that there are
    a LOT of variable elements in the DUCET.

    (unfortunately, the DUCET uses in itself a "strange" mixed order: it
    should be ordered itself in the default ("non-ignorable") order, or in
    the code point scalar value order to avoid possible duplicates, but
    analyzing its content is really tricky. In fact, all collation
    graphemes (single characters or the few contractions introduced for
    canonical equivalence with non-starter composites or for Thai/Lao
    exceptions to the logical order) that are not fully ignorable are
    effectively sorted in binary scalar value order, after all the fully
    ignorables.)

    I'll take some examples of how my proposed new "Separating" mode would
    apply to variable elements:

    0009 ; [.0201.0201.0020.0002.0009] # HORIZONTAL TABULATION (in 6429)
    000A ; [.0201.0202.0020.0002.000A] # LINE FEED (in 6429)
    000B ; [.0201.0203.0020.0002.000B] # VERTICAL TABULATION (in 6429)
    000C ; [.0201.0204.0020.0002.000C] # FORM FEED (in 6429)
    000D ; [.0201.0205.0020.0002.000D] # CARRIAGE RETURN (in 6429)
    0020 ; [.0201.020A.0020.0002.0020] # SPACE

    00A1 ; [.0201.026F.0020.0002.00A1] # INVERTED EXCLAMATION MARK
    ...
    02FE ; [.0201.041A.0020.0002.02FE] # MODIFIER LETTER OPEN SHELF
    02FF ; [.0201.041B.0020.0002.02FF] # MODIFIER LETTER LOW LEFT ARROW

    0374 ; [.0201.03E9.0020.0002.0374] # GREEK NUMERAL SIGN; QQC
    0375 ; [.0201.03EA.0020.0002.0375] # GREEK LOWER NUMERAL SIGN
    037E ; [.0201.0243.0020.0002.037E] # GREEK QUESTION MARK; QQC
    0384 ; [.0201.020F.0020.0002.0384] # GREEK TONOS; QQC
    0385 ; [.0201.0216.0020.0002.00A8][.0000..0000.0032.0002.0301] #
    GREEK DIALYTIKA TONOS; QQCM

    And other examples on the non-variable elements which are ignorable AT
    LEAST at the primary level:

    0000 ; [.0000.0000.0000.0000.0000] # [0000] NULL (in 6429)
    0001 ; [.0000.0000.0000.0000.0000] # [0001] START OF HEADING (in 6429)
    ...
    E01EF ; [.0000.0000.0000.0000.0000] # [E01EF] VARIATION SELECTOR-256

    0332 ; [.0000.0000.0021.0002.0332] # COMBINING LOW LINE
    0313 ; [.0000.0000.0022.0002.0313] # COMBINING COMMA ABOVE
    0343 ; [.0000.0000.0022.0002.0343] # COMBINING GREEK KORONIS; QQC
    0486 ; [.0000.0000.0022.0002.0486] # COMBINING CYRILLIC PSILI PNEUMATA; QQC
    2CF1 ; [.0000.0000.0022.0002.2CF1] # COPTIC COMBINING SPIRITUS LENIS; QQC
    0314 ; [.0000.0000.002A.0002.0314] # COMBINING REVERSED COMMA ABOVE
    0485 ; [.0000.0000.002A.0002.0485] # COMBINING CYRILLIC DASIA PNEUMATA; QQC
    2CF0 ; [.0000.0000.002A.0002.2CF0] # COPTIC COMBINING SPIRITUS ASPER; QQC

    And other examples on other non-variable elements that are not
    ignorable, or including those with expansions:

    1D00 ; [.1213.0020.0002.FFFF.1D00] # LATIN LETTER SMALL CAPITAL A
    ...
    1F00 ; [.1545.0020.0002.FFFF.03B1][.0000.0000.0022.0002.0313] # GREEK
    SMALL LETTER ALPHA WITH PSILI; QQCM
    1F08 ; [.1545.0020.0008.FFFF.0391][.0000.0000.0022.0002.0313] # GREEK
    CAPITAL LETTER ALPHA WITH PSILI; QQCM
    1F04 ; [.1545.0020.0002.FFFF.03B1][.0000.0000.0022.0002.0313][.0000.0000.0032.0002.0301]
    # GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA; QQCM

    Why do I propose this mode ?

    Because the existing default "Shifted" mode (or the derived
    "Shift-Trimmed" mode for POSIX) is completely impossible to tweak
    correctly in tailorings, without lots of corrections.

    What I propose is a safer and much simpler way to manage the
    tailorings of variable characters that are considered as "word
    separators", as they will collate all together at the primary level,
    as if they were all SPACEs at this level, without sacrificing
    completely their finely-tuned relative order, and without the risk of
    forgetting some of them (notably when new version of Unicode add many
    new variable elements, notably symbols like pictograms and emojis)

    With this mode, it will be MUCH simpler to tailor specially only very
    few variable elements so that they will be effectively ignored at the
    primary level. For example, in French we just want apostrophes and
    parentheses to be treated as ignorable, all the other variable
    elements would be treated as separators, that would keep their
    relative order, at least in the lower levels, without having to tailor
    them specifically. No change will be needed in the DUCET format.

    Note for example how the French rules are documented in French
    Wiktionnary (for now it does not use UCA, but assigns primary weights
    for all variable elements by treating most of them as SPACE, with only
    a few of them treated as ignorable, by dropping them from the given
    sort key) : there were many months of discussions (and
    experimentations or bug reports to solve) to get the correct ordering,
    at least at the primary level (the secondary level for diacritics is
    NOT supported but partially emulated without support for backward
    ordering and with some caveats as it uses the binary ordering, and
    only the tertiary level for case differences is almost fully
    supported, by sorting all lowercase before all
    uppercase/titlecase/final variants).

    http://fr.wiktionary.org/wiki/Modle:cl_de_tri

    As MediaWiki is about to introduce UCA-based collation for sorting
    categories, this should be done without sacrificing the various issues
    that have been solved at the primary level (which is now stable since
    2007 and causes problems to nobody, except when handling other
    languages than French, because there's no possibility to support the
    Spanish 'll' contractions or Catalan 'l' contractions, or Swedish
    'ch' contractions, with a single sort key for multilingual articles,
    note also that French order needs absolutely no contraction).

    My proposal addresses a general need for the primary level ONLY, other
    levels won't be affected within the relative ordering of texts that
    sort as equal in this "new" primary level, so we keep all the
    advantages of the DUCET. Some needs that it addresses at the primary
    level:

    - It sorts "aujourdhui" with "aujourdhui" by only tailoring the
    apostrophe on this mode so that they get ignorable (primary weight
    0000 instead of 0201 in the new mode)
    - It sorts "fleur(s)" with "fleurs" by only tailoring the parentheses
    (primary weight 0000 instead of 0201 in the new mode)
    - It sorts "km/h" with "km h" without tailoring (primary weight 0201
    in the new mode)
    - It sorts all symbols with " " without tailoring (primary weight 0201
    in the new mode), while keeping all their relative order.

    Philippe.



    This archive was generated by hypermail 2.1.5 : Sat Jul 31 2010 - 08:23:38 CDT