Re: Changing UCA primary weights (bad idea)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon Jul 12 2004 - 11:18:35 CDT

  • Next message: Mark Davis: "Re: Changing UCA primary weights (bad idea)"

    >I am positive that all of my
    tailorings for Sybase will be *affected*, for example. I don't think
    they will be *substantially* affected, in the sense of any complete
    redefinition of how the tailoring itself is defined. I don't think
    they will be *substantially* affected, in the sense of any complete
    redefinition of how the tailoring itself is defined. But the very
    fact of the proposed rearrangements will hit some internals, and
    it *will* result in changed key values, which effectively version
    the tailorings, as far as I am concerned.

    We know that if the UCA is changed in any way, then any tailoring is
    affected in that it will produce a different ordering for some characters
    (any that it does not explicitly override). So any implementation's
    versioning scheme must take account of this. This will always be the case,
    unless we completely freeze the UCA, disallowing fixes for, say, Indic
    characters. But the UTC clearly has not agreed to do this in the past; while
    stability is very important, we have left ourselves the ability to make
    changes in the UCA when warranted. So the key here is to assess whether any
    proposed changes are warranted.

    So the question is whether Sybase tailorings, such as German, will be
    affected positively or negatively, and to what degree. If a German customer
    is accessing a database full of European names, and expects to find Ę with
    E, and Ą with A and Ż with Z and Ł with L, then he will be right *except*
    for the last one.

    And this has consequences; it is not just an academic exercise. If s/he
    expects that a database SELECT of all client names starting with "L" will
    include the "Ł" names also, then s/he will get the wrong answer in a
    financial report -- and probably not realizing it is wrong. If s/he looks
    for a client name Słownik* within a page of Sl... and doesn't think to look
    3 pages down after Sz, then s/he will get the wrong answer. If s/he searches
    for a name within a body of text using a weak language-sensitive match, and
    doesn't find it, then s/he will get the wrong answer.

    ‎Mark

    * Made-up name; someone might give better example.

    ----- Original Message -----
    From: "Kenneth Whistler" <kenw@sybase.com>
    To: <mark.davis@jtcsv.com>
    Cc: <unicode@unicode.org>
    Sent: Friday, July 09, 2004 14:11
    Subject: Re: Changing UCA primary weights (bad idea)

    > Subject: Re: Changing UCA primarly weights (bad idea)
                                ^^^^^^^^

    Correcting the subject, just because it bugs me...

    > You are certainly right that this is not a slam-dunk; there are reasons
    for
    > and against it. And it may well be that the committee decides against it.

    Yes.

    > However, you overstate the situation with tailorings. The only tailorings
    > that would be affected are ones where the tailoring depends on inheriting
    > the order from the UCA for the affected characters.

    > So the number of tailorings that in practice would be affected I suspect
    to
    > be very small. However, if you have actual evidence of tailorings that
    would
    > be adversely affected by John Cowan's list, I would love to see it.

    The European Ordering Rules are the obvious first instance impacted
    heavily by this.

    And I think you may underestimate the number of tailorings that
    in practice would be affected. It depends, in part, on how the
    tailorings of UCA are implemented. I am positive that all of my
    tailorings for Sybase will be *affected*, for example. I don't think
    they will be *substantially* affected, in the sense of any complete
    redefinition of how the tailoring itself is defined. But the very
    fact of the proposed rearrangements will hit some internals, and
    it *will* result in changed key values, which effectively version
    the tailorings, as far as I am concerned.

    > > 2) it proposes to reverse the *explicit* design principles that went
    > > into the default tailorable template in the *first* place. Similar
    > > letters are near -- but not interfiled with -- similar letters. This
    > > is MORE than enough to give any casual user the functionality he
    > > needs, because only in initial position is there likely to be any
    > > confusion in real-life sorted word lists, and even then, hooked-b
    > > follows bz, which is hardly burdensome for the end user.
    >
    > This also completely overstates the case. What we actually did was to put
    > similar letters near other letters, *and if their decompositions were the
    > same* we interfiled them. There is, however, little principled difference
    > between [[Editing down the list a bit: O-slash, O-with-horn,
    > and O-with-circumflex]]" that would cause a user to think that the
                                                ^^^^^^
                                                recte: a naive user unfamiliar
                                                       with IPA
    > some should be interfiled and some should not.

    Actually, there is: o-slash *is* treated as a separate letter in
    the pronunciation guides of all IPA-based dictionaries, which constitute
    the majority of the world's usage, currently. The same is not true
    for various o's with miscellaneous scattered accents sprinkled on.

    > In some languages these would
    > be seen as "separate letters" (e.g. with different primary weights) and in
    > others not; but that does not line up in any particular way with what is
    in
    > the UCA. (see also comment below).

    The more important correlation here that was attempted in the UCA was
    to keep the treatment more or less in synch with the formal decompositions
    provided in the UnicodeData.txt file. The more those diverge, the more
    difficult maintaining the tables becomes (technically) and the more
    difficult it becomes to hold the line (politically) against special
    claims from one constituency or another that "their" letter be weighted
    "correctly" in the default table for one reason or another.

    > > 3) in discussions elsewhere, Mark has talked about what "most users"
    > > "expect" and I found his suggestion to be anglocentric and
    > > unsubstantiated.
    >
    > And I will refrain from saying what I think of your reasoning ability in
    > general, although circularity seems to be a particular specialty. I
    suggest
    > that we stick to the facts instead of ad hominem attacks.

    Oh, very clever. An ad hominem attack which consists of claiming
    that you will refrain from saying something ad hominem.

    > For user expectations, check out how foreign words with unusual accents
    are
    > sorted in a variety of languages. I have seen no reason to believe that
    > Germans or French or others behave much differently when faced with a
    letter
    > like ø that is not one that they use. The key is whether they would
    expect
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    Actually, the key is whether they have any expectations based on their
    own language orthography's use of it *and* whether they have any
    expectations based on their own language's lexicographical practice
    in using it. I claim that outside the context of the United States,
    which has its own archaic and impoverished pronunciation guide
    practices, the majority of dictionary users in the world have some
    exposure to and familiarity with basic IPA letters. How that impacts
    their expectations regarding *interleaving* of IPA letters with
    other Latin letters is unclear, however, and probably is not something
    that could really be pinned down in the absence of ingrained, taught
    practice regarding what the rules *should* be in one culture or
    another. And what answer you get will be influenced by how you
    ask it. If you start out with "Do you think that all letters
    that are like o's should sort together by default...?" then you've
    already begged your answer.

    > to see:
    >
    > a) Interleaved:

    > b) Separate but near:

    > c) Like a particular language

    > People I've talked to, from various different backgrounds, have expected
    > behavior (a) for both letters ø and ö, or occasionally (b) for them.
    > *Nobody* expected the UCA-type inconsistency: behavior (c) for ø, but
    > behavior (a) for ö.

    Nobody -- not even me -- since the UCA behavior is (b) for O-slash (not
    (c)), and (a) for O-diaresis.

    > Moreover, this is also inconsistent with any generative use of characters
    > like stroke, since they are always interfiled in UCA.

    Which *is* consistent with the encoding decisions that have been
    taken recently by the UTC to treat such overlaid diacritic characters
    as *separately* encoded atomic characters, not decomposed into
    sequences of base letters plus combining overlays.

    > More accurately, you believe that the correct behavior occurs. (Sadly,
    using
    > BOLDFACE doesn't make it more true.)

    Nor will prefacing your remarks with "Sadly" make them more convincing,
    sympathetic (or lachrymose).

    > But you offer no evidence. Ã. is seen as
    > a separate letter in the languages that use it, but UCA "interfiles" it.
    Ł
    > is also seen as a separate letter, and UCA doesn't. Let's hear some
    evidence
    > from your side, like people's reactions to the above cases.

    O.k. You got my reaction. *I* think that Michael is correct, and
    that the table treating the characters with overlaid and excrescent
    diacritics as atomic characters that sort *near* letters they
    are diacritic derivatives of, is correct behavior for a default.

    > > 5) if Mark wants to make a tailoring to interfile all these letters
    > > (which can only result in what I describe as "visual seasickess" to
    > > any poor users who have to actually read such wordlists.
    >
    > Again, no evidence. Let's look at a particular example, letters based on
    > "O". UCA *already* interleaves the list below (UCA O List). Adding John's
    > list to that would add only the two elements:
    >
    > 00F8; LATIN SMALL LETTER O WITH STROKE
    > 01FF; LATIN SMALL LETTER O WITH STROKE AND ACUTE
    >
    > I fail to see your purported user would swamped by the relative magnitude
    of
    > the change, which in the case of O would be adding about 1% more
    interleaved
    > O's. How is this addition going to cause "visual seasickness", I wonder?
    >
    > UCA O List
    > ====================
    [Superfluous list of O's cited from allkeys.txt excised]

    This completely misses Michael's point, whether accidentally or
    deliberately for rhetorical effect is unclear.

    The "visual seasickness" has nothing whatsoever to do with what
    a *tailoring* table would look like, which is why your citation of 1%
    more interleaving and then citing the entire UCA O list is
    completely superfluous.

    What Michael is talking about is the following effect:

    ofofofo
    oføfofo
    øfofofo
    øfoføfø
    ofofofp

    in those instances where a user *does* consider ø to be
    a separate letter, in which case the current default behavior is
    better:

    ofofofo
    ofofofp
    oføfofo
    øfofofo
    øfoføfø

    The relative impact of this on a user and on the usability of
    the resulting sorted wordlists has everything to do with
    the frequency of the *letters* in question, and nothing
    whatsoever to do with the number of lines in a tailoring
    table.

    > > 6) the Latin alphabet has a lot more than 26 letters in it. In this
    > > age of the Universal Character Set, "most users" would do better to
    > > get used to this than to be hobbled by older concepts.
    >
    > I agree with the general principle, but it has no bearing on the topic at
    > hand.

    Actually, it *does* have a significant bearing on the topic at
    hand. For many of us, æ *is* a separate letter of the Latin alphabet,
    and not a spelling equivalent of a+e. For many of us, ø *is* a
    separate letter of the Latin alphabet, and not some o with an
    unfamiliar diacritic on it. This is certainly the case for the
    most commonly used IPA letters in wide use in dictionary
    pronunciation guides and also in wide use in various language
    orthographies based on IPA use. It is much less clear for the
    various non-IPA letters with diacritic overlays, and in those
    instances I am more open to arguments that adjusting them to
    fit with particular, commonly used language's ordering expectations
    might be useful for the default table.

    But any such changes would, I believe, be disruptive, as well,
    and the cost of the disruptions needs to be weighed against the
    (arguably) incrementally better behavior which result in
    the default ordering.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Jul 12 2004 - 11:19:55 CDT