Re: Changing UCA primary weights (bad idea)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon Jul 12 2004 - 11:18:54 CDT

  • Next message: Mark Davis: "User Expectations for collation (was Re: Looking for transcription or transliteration standards latin->arabic)"

    > John [Cowan]'s list is not "a few characters".

    Let's take Latin, for starters. There are 1870 entries in the UCA for Latin.
    If you subtract from John's list the ones that are already interleaved -- as
    I did in my email -- then you get 78 values, or about 4%.

    I'll repeat that list again below, since it seems to have missed notice.
    Now, one could argue that the letters without uppercase pairs are only used
    technically (e.g. in IPA), and thus should be excluded. If so, that leaves
    us with 52 (26 upper+lower), or about 3%.

    If we really wanted to minimize the number of changes, then we could exclude
    the ones that are for languages that rarely occur in data. I did a quick
    check on http://www.eki.ee/letter/, and put what I found below. This is
    *not* a complete analysis, and would need to be extended to the other
    scripts, but we would then be talking about 10 letters (5 upper+lower) or
    0.5% with a very restrictive list, about double that if we included a few
    more.

    So, yes, I do think it will probably end up being a pretty small list.

    Mark

    =======
    Capitals by language on http://www.eki.ee/letter/

    da [Danish]; fo [Faroese]; kl [Greenlandic]; no [Norwegian];

     00D8; 004F; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER O WITH STROKE
     01FE; 004F; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER O WITH STROKE AND
    ACUTE (no information, but included for consistency with O WITH STROKE)

    bs [Bosnian]; hr [Croatian]; sami1 [Inari Sámi]; sami2 [North Sámi]; sami4
    [Skolt Sámi]; sl [Slovenian]; vi [Vietnamese];

     0110; 0044; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER D WITH STROKE

    mt [Maltese];

     0126; 0048; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER H WITH STROKE

    pl [Polish]; sorb1 [Lower Sorbian]; sorb2 [Upper Sorbian]; sla [Kashubian];

     0141; 004C; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER L WITH STROKE

    sami2 [North Sámi];

     0166; 0054; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER T WITH STROKE
     01E4; 0047; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER G WITH STROKE

    ha [Hausa]; ff [Fula]; or bm [Bambara];

     0181; 0042; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER B WITH HOOK
     018A; 0044; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER D WITH HOOK
     0198; 004B; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER K WITH HOOK
     01B3; 0059; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER Y WITH HOOK
     019D; 004E; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER N WITH LEFT HOOK

    No Information

     0187; 0043; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER C WITH HOOK
     0191; 0046; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER F WITH HOOK
     0193; 0047; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER G WITH HOOK
     01A4; 0050; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER P WITH HOOK
     01AC; 0054; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER T WITH HOOK
     01B2; 0056; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER V WITH HOOK
     0224; 005A; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER Z WITH HOOK
     0197; 0049; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER I WITH STROKE
     01B5; 005A; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER Z WITH STROKE
     0182; 0042; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER B WITH TOPBAR
     018B; 0044; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER D WITH TOPBAR

     0220; 004E; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER N WITH LONG RIGHT
    LEG
     019F; 004F; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER O WITH MIDDLE
    TILDE
     01AE; 0054; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER T WITH RETROFLEX
    HOOK

    ==============
    List of items from John's list that are not already interleaved.

     0181; 0042; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER B WITH HOOK
     0182; 0042; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER B WITH TOPBAR
     0187; 0043; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER C WITH HOOK
     0110; 0044; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER D WITH STROKE
     018A; 0044; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER D WITH HOOK
     018B; 0044; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER D WITH TOPBAR
     0191; 0046; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER F WITH HOOK
     0193; 0047; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER G WITH HOOK
     01E4; 0047; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER G WITH STROKE
     0126; 0048; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER H WITH STROKE
     0197; 0049; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER I WITH STROKE
     0198; 004B; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER K WITH HOOK
     0141; 004C; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER L WITH STROKE
     019D; 004E; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER N WITH LEFT HOOK
     0220; 004E; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER N WITH LONG RIGHT
    LEG
     00D8; 004F; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER O WITH STROKE
     019F; 004F; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER O WITH
    MIDDLETILDE
     01FE; 004F; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER O WITH STROKE AND
    ACUTE
     01A4; 0050; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER P WITH HOOK
     0166; 0054; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER T WITH STROKE
     01AC; 0054; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER T WITH HOOK
     01AE; 0054; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER T WITH RETROFLEX
    HOOK
     01B2; 0056; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER V WITH HOOK
     01B3; 0059; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER Y WITH HOOK
     01B5; 005A; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER Z WITH STROKE
     0224; 005A; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER Z WITH HOOK

     1E9A; 0061; !nfd+remove_marks; !uca #LATIN SMALL LETTER A WITH RIGHT
    HALFRING
     0180; 0062; !nfd+remove_marks; !uca #LATIN SMALL LETTER B WITH STROKE
     0183; 0062; !nfd+remove_marks; !uca #LATIN SMALL LETTER B WITH TOPBAR
     0253; 0062; !nfd+remove_marks; !uca #LATIN SMALL LETTER B WITH HOOK
     0188; 0063; !nfd+remove_marks; !uca #LATIN SMALL LETTER C WITH HOOK
     0255; 0063; !nfd+remove_marks; !uca #LATIN SMALL LETTER C WITH CURL
     0111; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH STROKE
     018C; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH TOPBAR
     0221; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH CURL
     0256; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH TAIL
     0257; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH HOOK
     0192; 0066; !nfd+remove_marks; !uca #LATIN SMALL LETTER F WITH HOOK
     01E5; 0067; !nfd+remove_marks; !uca #LATIN SMALL LETTER G WITH STROKE
     0260; 0067; !nfd+remove_marks; !uca #LATIN SMALL LETTER G WITH HOOK
     0127; 0068; !nfd+remove_marks; !uca #LATIN SMALL LETTER H WITH STROKE
     0266; 0068; !nfd+remove_marks; !uca #LATIN SMALL LETTER H WITH HOOK
     0268; 0069; !nfd+remove_marks; !uca #LATIN SMALL LETTER I WITH STROKE
     029D; 006A; !nfd+remove_marks; !uca #LATIN SMALL LETTER J WITH CROSSED-TAIL
     0199; 006B; !nfd+remove_marks; !uca #LATIN SMALL LETTER K WITH HOOK
     0140; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH MIDDLE DOT
     0142; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH STROKE
     019A; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH BAR
     0234; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH CURL
     026B; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH MIDDLE TILDE
     026C; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH BELT
     026D; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH RETROFLEX
    HOOK
     0271; 006D; !nfd+remove_marks; !uca #LATIN SMALL LETTER M WITH HOOK
     019E; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH LONG
    RIGHTLEG
     0235; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH CURL
     0272; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH LEFT HOOK
     0273; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH RETROFLEX
    HOOK
     00F8; 006F; !nfd+remove_marks; !uca #LATIN SMALL LETTER O WITH STROKE
     01FF; 006F; !nfd+remove_marks; !uca #LATIN SMALL LETTER O WITH STROKE AND
    ACUTE
     01A5; 0070; !nfd+remove_marks; !uca #LATIN SMALL LETTER P WITH HOOK
     02A0; 0071; !nfd+remove_marks; !uca #LATIN SMALL LETTER Q WITH HOOK
     027C; 0072; !nfd+remove_marks; !uca #LATIN SMALL LETTER R WITH LONG LEG
     027D; 0072; !nfd+remove_marks; !uca #LATIN SMALL LETTER R WITH TAIL
     0282; 0073; !nfd+remove_marks; !uca #LATIN SMALL LETTER S WITH HOOK
     0167; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH STROKE
     01AB; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH PALATAL HOOK
     01AD; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH HOOK
     0236; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH CURL
     0288; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH RETROFLEX
    HOOK
     028B; 0076; !nfd+remove_marks; !uca #LATIN SMALL LETTER V WITH HOOK
     01B4; 0079; !nfd+remove_marks; !uca #LATIN SMALL LETTER Y WITH HOOK
     01B6; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH STROKE
     0225; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH HOOK
     0290; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH RETROFLEX
    HOOK
     0291; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH CURL
     025A; 0259; !nfd+remove_marks; !uca #LATIN SMALL LETTER SCHWA WITH HOOK
     0286; 0283; !nfd+remove_marks; !uca #LATIN SMALL LETTER ESH WITH CURL
     01BA; 0292; !nfd+remove_marks; !uca #LATIN SMALL LETTER EZH WITH TAIL
     0293; 0292; !nfd+remove_marks; !uca #LATIN SMALL LETTER EZH WITH CURL

    ‎Mark

    ----- Original Message -----
    From: "Michael Everson" <everson@evertype.com>
    To: <unicode@unicode.org>
    Sent: Saturday, July 10, 2004 04:20
    Subject: Re: Changing UCA primary weights (bad idea)

    > At 17:34 -0700 2004-07-09, Mark Davis wrote:
    >
    > >What I think we should be examining is which of the items that are not
    > >interfiled (to use your phrasing) should be, if any. I don't think
    > >everything should be. In particular, I think John's list is the list we
    > >should be focusing on.
    >
    > I think most of what is in John [Cowan]'s list
    > are letters which are quite properly not
    > interfiled with "base" letters. The African hook
    > letters (which I have mentioned many times, and
    > which you have ignored in favour of the Danish
    > letters you are more familiar with) are there.
    >
    > > > John's list?
    > >
    > >That's was in my original mail, that you were commenting on when you
    changed
    > >the subject line, but which you didn't apparently didn't bother to
    actually
    > >read.
    >
    > Sweet of you to say.
    >
    > > > My point is made here. It is really only in
    > >> initial position where this is likely to be
    > >> noticed.
    > >
    > >This is incorrect. It will make a difference in other positions. Sorting
    > >"Søren" after "Sozar" in a long list, if someone isn't expecting it, will
    > >cause problems. They look for it after "Soret", don't see it on the page,
    > >and assume it isn't there; fooled by the fact that it is on a completely
    > >different page.
    >
    > No way! Do you expect your default tailorable
    > template to suddenly and magically relieve the
    > user of the problems of long lists and multi-page
    > typesetting? Sheesh. No matter how much you
    > jiggle either the template or a tailoring for
    > people who only know the letters A-Z, there will
    > be edge cases which will fail this kind of test.
    >
    > >Remember that the collation sequence is also used for language-sensitive
    > >matching as well as sorting.
    >
    > I remember.
    >
    > > > What I want is the status quo, however.
    > >> Leave the template and its principles alone.
    > >
    > >Stability is important, and we want to consider that very carefully
    before
    > >making any change. However, I believe that the current way we handle a
    few
    > >characters in UCA is distinctly suboptimal, and worth considering.
    >
    > John [Cowan]'s list is not "a few characters".
    > --
    > Michael Everson * * Everson Typography * * http://www.evertype.com
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Jul 12 2004 - 11:19:55 CDT