Re: Changing UCA primary weights (bad idea)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Fri Jul 09 2004 - 20:00:28 CDT

  • Next message: Mike Ayers: "RE: Changing UCA primarly weights (bad idea)"

    You say:

    > Actually, there is: o-slash *is* treated as a separate letter in
    > the pronunciation guides of all IPA-based dictionaries, which constitute
    > the majority of the world's usage, currently.

    First, I don't know that UCA out of the box sorts IPA correctly -- nor do I
    have much of an idea what constitutes the "correct" IPA sorting. Does the
    IPA specification itself have any sorting requirements in it? If so, can you
    point to it?

    The fact that IPA uses these letters as being *different* is completely
    aside from the point. Everyone agrees that they are different characters: Å
    and A are different characters, but interleaved in UCA; Ł and L are
    different characters, but not interleaved in UCA.

    Secondly, the amount of sorted IPA data is going to be dwarfed by the amount
    of data sorted according to particular language conventions. A Swedish
    company with customers all over Europe is going to have names with all
    different accents in it, and the question is what a German (French,
    Italian,...) employee is going to expect to for the ordering and/or matching
    when they sit down and look at a screen, after setting it to sort according
    to their language. Now of course, we could tailor each and every language
    that used these characters, but that's kinda defeating the point of the UCA.

    The number of people that are going to sit down and select an IPA sorting
    for such data -- do you really think that is significant in comparison?

    You also say.

    > And what answer you get will be influenced by how you
    > ask it. If you start out with "Do you think that all letters
    > that are like o's should sort together by default...?"

    All I did was give people examples like I presented in the document, and
    asked what there expectations were, without a leading question. And these
    were not all or even mostly Americans, despite your and Michael's
    assumptions.

    You then give the example:

    > What Michael is talking about is the following effect:

    (1a)
    ofofofo
    oføfofo
    øfofofo
    øfoføfø
    ofofofp

    > in those instances where a user *does* consider ø to be
    > a separate letter, in which case the current default behavior is
    > better:

    (1b)
    ofofofo
    ofofofp
    oføfofo
    øfofofo
    øfoføfø

    > The relative impact of this on a user and on the usability of
    > the resulting sorted wordlists has everything to do with
    > the frequency of the *letters* in question, and nothing
    > whatsoever to do with the number of lines in a tailoring
    > table.

    This just doesn't make any sense to me. You think that (2) is prefered to
    (1) for two cases of O's, but not for 102 others, like

    (2a)
    ofofofo
    ofơfofo
    ơfofofo
    ơfofơfơ
    ofofofp

    (2b)
    ofofofo
    ofofofp
    ofơfofo
    ơfofofo
    ơfofơfơ

    or

    (3a)
    ofofofo
    ofõfofo
    õfofofo
    õfofõfõ
    ofofofp

    (3b)
    ofofofo
    ofofofp
    ofõfofo
    õfofofo
    õfofõfõ

    Why do you find (1a) visually so disturbing, but don't find (2a) and (3a) or
    the other 51 cases visually disturbing? And of course visual disturbance of
    multiple characters in such artifical examples with multiple marks has
    little to do with sorting behavior.

    I suspect that your and Michaels intutitions are at this point not
    representative; you might try taking some neutral examples and actually
    trying them out -- without yourself asking a leading question -- even with
    non-Americans, I doubt that you will find that people are as inconsistent in
    their expectations as the UCA is.

    ‎Mark

    ----- Original Message -----
    From: "Kenneth Whistler" <kenw@sybase.com>
    To: <mark.davis@jtcsv.com>
    Cc: <unicode@unicode.org>
    Sent: Friday, July 09, 2004 14:11
    Subject: Re: Changing UCA primary weights (bad idea)

    > Subject: Re: Changing UCA primarly weights (bad idea)
                                ^^^^^^^^

    Correcting the subject, just because it bugs me...

    > You are certainly right that this is not a slam-dunk; there are reasons
    for
    > and against it. And it may well be that the committee decides against it.

    Yes.

    > However, you overstate the situation with tailorings. The only tailorings
    > that would be affected are ones where the tailoring depends on inheriting
    > the order from the UCA for the affected characters.

    > So the number of tailorings that in practice would be affected I suspect
    to
    > be very small. However, if you have actual evidence of tailorings that
    would
    > be adversely affected by John Cowan's list, I would love to see it.

    The European Ordering Rules are the obvious first instance impacted
    heavily by this.

    And I think you may underestimate the number of tailorings that
    in practice would be affected. It depends, in part, on how the
    tailorings of UCA are implemented. I am positive that all of my
    tailorings for Sybase will be *affected*, for example. I don't think
    they will be *substantially* affected, in the sense of any complete
    redefinition of how the tailoring itself is defined. But the very
    fact of the proposed rearrangements will hit some internals, and
    it *will* result in changed key values, which effectively version
    the tailorings, as far as I am concerned.

    > > 2) it proposes to reverse the *explicit* design principles that went
    > > into the default tailorable template in the *first* place. Similar
    > > letters are near -- but not interfiled with -- similar letters. This
    > > is MORE than enough to give any casual user the functionality he
    > > needs, because only in initial position is there likely to be any
    > > confusion in real-life sorted word lists, and even then, hooked-b
    > > follows bz, which is hardly burdensome for the end user.
    >
    > This also completely overstates the case. What we actually did was to put
    > similar letters near other letters, *and if their decompositions were the
    > same* we interfiled them. There is, however, little principled difference
    > between [[Editing down the list a bit: O-slash, O-with-horn,
    > and O-with-circumflex]]" that would cause a user to think that the
                                                ^^^^^^
                                                recte: a naive user unfamiliar
                                                       with IPA
    > some should be interfiled and some should not.

    Actually, there is: o-slash *is* treated as a separate letter in
    the pronunciation guides of all IPA-based dictionaries, which constitute
    the majority of the world's usage, currently. The same is not true
    for various o's with miscellaneous scattered accents sprinkled on.

    > In some languages these would
    > be seen as "separate letters" (e.g. with different primary weights) and in
    > others not; but that does not line up in any particular way with what is
    in
    > the UCA. (see also comment below).

    The more important correlation here that was attempted in the UCA was
    to keep the treatment more or less in synch with the formal decompositions
    provided in the UnicodeData.txt file. The more those diverge, the more
    difficult maintaining the tables becomes (technically) and the more
    difficult it becomes to hold the line (politically) against special
    claims from one constituency or another that "their" letter be weighted
    "correctly" in the default table for one reason or another.

    > > 3) in discussions elsewhere, Mark has talked about what "most users"
    > > "expect" and I found his suggestion to be anglocentric and
    > > unsubstantiated.
    >
    > And I will refrain from saying what I think of your reasoning ability in
    > general, although circularity seems to be a particular specialty. I
    suggest
    > that we stick to the facts instead of ad hominem attacks.

    Oh, very clever. An ad hominem attack which consists of claiming
    that you will refrain from saying something ad hominem.

    > For user expectations, check out how foreign words with unusual accents
    are
    > sorted in a variety of languages. I have seen no reason to believe that
    > Germans or French or others behave much differently when faced with a
    letter
    > like ø that is not one that they use. The key is whether they would
    expect
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    Actually, the key is whether they have any expectations based on their
    own language orthography's use of it *and* whether they have any
    expectations based on their own language's lexicographical practice
    in using it. I claim that outside the context of the United States,
    which has its own archaic and impoverished pronunciation guide
    practices, the majority of dictionary users in the world have some
    exposure to and familiarity with basic IPA letters. How that impacts
    their expectations regarding *interleaving* of IPA letters with
    other Latin letters is unclear, however, and probably is not something
    that could really be pinned down in the absence of ingrained, taught
    practice regarding what the rules *should* be in one culture or
    another. And what answer you get will be influenced by how you
    ask it. If you start out with "Do you think that all letters
    that are like o's should sort together by default...?" then you've
    already begged your answer.

    > to see:
    >
    > a) Interleaved:

    > b) Separate but near:

    > c) Like a particular language

    > People I've talked to, from various different backgrounds, have expected
    > behavior (a) for both letters ø and ö, or occasionally (b) for them.
    > *Nobody* expected the UCA-type inconsistency: behavior (c) for ø, but
    > behavior (a) for ö.

    Nobody -- not even me -- since the UCA behavior is (b) for O-slash (not
    (c)), and (a) for O-diaresis.

    > Moreover, this is also inconsistent with any generative use of characters
    > like stroke, since they are always interfiled in UCA.

    Which *is* consistent with the encoding decisions that have been
    taken recently by the UTC to treat such overlaid diacritic characters
    as *separately* encoded atomic characters, not decomposed into
    sequences of base letters plus combining overlays.

    > More accurately, you believe that the correct behavior occurs. (Sadly,
    using
    > BOLDFACE doesn't make it more true.)

    Nor will prefacing your remarks with "Sadly" make them more convincing,
    sympathetic (or lachrymose).

    > But you offer no evidence. Ã. is seen as
    > a separate letter in the languages that use it, but UCA "interfiles" it.
    Ł
    > is also seen as a separate letter, and UCA doesn't. Let's hear some
    evidence
    > from your side, like people's reactions to the above cases.

    O.k. You got my reaction. *I* think that Michael is correct, and
    that the table treating the characters with overlaid and excrescent
    diacritics as atomic characters that sort *near* letters they
    are diacritic derivatives of, is correct behavior for a default.

    > > 5) if Mark wants to make a tailoring to interfile all these letters
    > > (which can only result in what I describe as "visual seasickess" to
    > > any poor users who have to actually read such wordlists.
    >
    > Again, no evidence. Let's look at a particular example, letters based on
    > "O". UCA *already* interleaves the list below (UCA O List). Adding John's
    > list to that would add only the two elements:
    >
    > 00F8; LATIN SMALL LETTER O WITH STROKE
    > 01FF; LATIN SMALL LETTER O WITH STROKE AND ACUTE
    >
    > I fail to see your purported user would swamped by the relative magnitude
    of
    > the change, which in the case of O would be adding about 1% more
    interleaved
    > O's. How is this addition going to cause "visual seasickness", I wonder?
    >
    > UCA O List
    > ====================
    [Superfluous list of O's cited from allkeys.txt excised]

    This completely misses Michael's point, whether accidentally or
    deliberately for rhetorical effect is unclear.

    The "visual seasickness" has nothing whatsoever to do with what
    a *tailoring* table would look like, which is why your citation of 1%
    more interleaving and then citing the entire UCA O list is
    completely superfluous.

    What Michael is talking about is the following effect:

    ofofofo
    oføfofo
    øfofofo
    øfoføfø
    ofofofp

    in those instances where a user *does* consider ø to be
    a separate letter, in which case the current default behavior is
    better:

    ofofofo
    ofofofp
    oføfofo
    øfofofo
    øfoføfø

    The relative impact of this on a user and on the usability of
    the resulting sorted wordlists has everything to do with
    the frequency of the *letters* in question, and nothing
    whatsoever to do with the number of lines in a tailoring
    table.

    > > 6) the Latin alphabet has a lot more than 26 letters in it. In this
    > > age of the Universal Character Set, "most users" would do better to
    > > get used to this than to be hobbled by older concepts.
    >
    > I agree with the general principle, but it has no bearing on the topic at
    > hand.

    Actually, it *does* have a significant bearing on the topic at
    hand. For many of us, æ *is* a separate letter of the Latin alphabet,
    and not a spelling equivalent of a+e. For many of us, ø *is* a
    separate letter of the Latin alphabet, and not some o with an
    unfamiliar diacritic on it. This is certainly the case for the
    most commonly used IPA letters in wide use in dictionary
    pronunciation guides and also in wide use in various language
    orthographies based on IPA use. It is much less clear for the
    various non-IPA letters with diacritic overlays, and in those
    instances I am more open to arguments that adjusting them to
    fit with particular, commonly used language's ordering expectations
    might be useful for the default table.

    But any such changes would, I believe, be disruptive, as well,
    and the cost of the disruptions needs to be weighed against the
    (arguably) incrementally better behavior which result in
    the default ordering.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri Jul 09 2004 - 20:01:11 CDT