Re: Changing UCA primary weights (bad idea)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Fri Jul 09 2004 - 20:00:28 CDT

Next message: Mike Ayers: "RE: Changing UCA primarly weights (bad idea)"

Previous message: Mark Davis: "Re: Changing UCA primarly weights (bad idea)"
In reply to: Kenneth Whistler: "Re: Changing UCA primary weights (bad idea)"
Next in thread: Mark Davis: "Re: Changing UCA primary weights (bad idea)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

You say:

> Actually, there is: o-slash *is* treated as a separate letter in
> the pronunciation guides of all IPA-based dictionaries, which constitute
> the majority of the world's usage, currently.

First, I don't know that UCA out of the box sorts IPA correctly -- nor do I
have much of an idea what constitutes the "correct" IPA sorting. Does the
IPA specification itself have any sorting requirements in it? If so, can you
point to it?

The fact that IPA uses these letters as being *different* is completely
aside from the point. Everyone agrees that they are different characters: Å
and A are different characters, but interleaved in UCA; Ł and L are
different characters, but not interleaved in UCA.

Secondly, the amount of sorted IPA data is going to be dwarfed by the amount
of data sorted according to particular language conventions. A Swedish
company with customers all over Europe is going to have names with all
different accents in it, and the question is what a German (French,
Italian,...) employee is going to expect to for the ordering and/or matching
when they sit down and look at a screen, after setting it to sort according
to their language. Now of course, we could tailor each and every language
that used these characters, but that's kinda defeating the point of the UCA.

The number of people that are going to sit down and select an IPA sorting
for such data -- do you really think that is significant in comparison?

You also say.

> And what answer you get will be influenced by how you
> ask it. If you start out with "Do you think that all letters
> that are like o's should sort together by default...?"

All I did was give people examples like I presented in the document, and
asked what there expectations were, without a leading question. And these
were not all or even mostly Americans, despite your and Michael's
assumptions.

You then give the example:

> What Michael is talking about is the following effect:

(1a)
ofofofo
oføfofo
øfofofo
øfoføfø
ofofofp

> in those instances where a user *does* consider ø to be
> a separate letter, in which case the current default behavior is
> better:

(1b)
ofofofo
ofofofp
oføfofo
øfofofo
øfoføfø

> The relative impact of this on a user and on the usability of
> the resulting sorted wordlists has everything to do with
> the frequency of the *letters* in question, and nothing
> whatsoever to do with the number of lines in a tailoring
> table.

This just doesn't make any sense to me. You think that (2) is prefered to
(1) for two cases of O's, but not for 102 others, like

(2a)
ofofofo
ofơfofo
ơfofofo
ơfofơfơ
ofofofp

(2b)
ofofofo
ofofofp
ofơfofo
ơfofofo
ơfofơfơ

(3a)
ofofofo
ofõfofo
õfofofo
õfofõfõ
ofofofp

(3b)
ofofofo
ofofofp
ofõfofo
õfofofo
õfofõfõ

Why do you find (1a) visually so disturbing, but don't find (2a) and (3a) or
the other 51 cases visually disturbing? And of course visual disturbance of
multiple characters in such artifical examples with multiple marks has
little to do with sorting behavior.

I suspect that your and Michaels intutitions are at this point not
representative; you might try taking some neutral examples and actually
trying them out -- without yourself asking a leading question -- even with
non-Americans, I doubt that you will find that people are as inconsistent in
their expectations as the UCA is.

‎Mark

----- Original Message -----
From: "Kenneth Whistler" <kenw@sybase.com>
To: <mark.davis@jtcsv.com>
Cc: <unicode@unicode.org>
Sent: Friday, July 09, 2004 14:11
Subject: Re: Changing UCA primary weights (bad idea)

> Subject: Re: Changing UCA primarly weights (bad idea)
^^^^^^^^

Correcting the subject, just because it bugs me...

> You are certainly right that this is not a slam-dunk; there are reasons
for
> and against it. And it may well be that the committee decides against it.

Yes.

> However, you overstate the situation with tailorings. The only tailorings
> that would be affected are ones where the tailoring depends on inheriting
> the order from the UCA for the affected characters.

> So the number of tailorings that in practice would be affected I suspect
to
> be very small. However, if you have actual evidence of tailorings that
would
> be adversely affected by John Cowan's list, I would love to see it.

The European Ordering Rules are the obvious first instance impacted
heavily by this.

And I think you may underestimate the number of tailorings that
in practice would be affected. It depends, in part, on how the
tailorings of UCA are implemented. I am positive that all of my
tailorings for Sybase will be *affected*, for example. I don't think
they will be *substantially* affected, in the sense of any complete
redefinition of how the tailoring itself is defined. But the very
fact of the proposed rearrangements will hit some internals, and
it *will* result in changed key values, which effectively version
the tailorings, as far as I am concerned.

> > 2) it proposes to reverse the *explicit* design principles that went
> > into the default tailorable template in the *first* place. Similar
> > letters are near -- but not interfiled with -- similar letters. This
> > is MORE than enough to give any casual user the functionality he
> > needs, because only in initial position is there likely to be any
> > confusion in real-life sorted word lists, and even then, hooked-b
> > follows bz, which is hardly burdensome for the end user.
>
> This also completely overstates the case. What we actually did was to put
> similar letters near other letters, *and if their decompositions were the
> same* we interfiled them. There is, however, little principled difference
> between [[Editing down the list a bit: O-slash, O-with-horn,
> and O-with-circumflex]]" that would cause a user to think that the
                                            ^^^^^^
                                            recte: a naive user unfamiliar
                                                   with IPA
> some should be interfiled and some should not.

Actually, there is: o-slash *is* treated as a separate letter in
the pronunciation guides of all IPA-based dictionaries, which constitute
the majority of the world's usage, currently. The same is not true
for various o's with miscellaneous scattered accents sprinkled on.

> In some languages these would
> be seen as "separate letters" (e.g. with different primary weights) and in
> others not; but that does not line up in any particular way with what is
in
> the UCA. (see also comment below).

The more important correlation here that was attempted in the UCA was
to keep the treatment more or less in synch with the formal decompositions
provided in the UnicodeData.txt file. The more those diverge, the more
difficult maintaining the tables becomes (technically) and the more
difficult it becomes to hold the line (politically) against special
claims from one constituency or another that "their" letter be weighted
"correctly" in the default table for one reason or another.

> > 3) in discussions elsewhere, Mark has talked about what "most users"
> > "expect" and I found his suggestion to be anglocentric and
> > unsubstantiated.
>
> And I will refrain from saying what I think of your reasoning ability in
> general, although circularity seems to be a particular specialty. I
suggest
> that we stick to the facts instead of ad hominem attacks.

Oh, very clever. An ad hominem attack which consists of claiming
that you will refrain from saying something ad hominem.

> For user expectations, check out how foreign words with unusual accents
are
> sorted in a variety of languages. I have seen no reason to believe that
> Germans or French or others behave much differently when faced with a
letter
> like Ã¸ that is not one that they use. The key is whether they would
expect
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Actually, the key is whether they have any expectations based on their
own language orthography's use of it *and* whether they have any
expectations based on their own language's lexicographical practice
in using it. I claim that outside the context of the United States,
which has its own archaic and impoverished pronunciation guide
practices, the majority of dictionary users in the world have some
exposure to and familiarity with basic IPA letters. How that impacts
their expectations regarding *interleaving* of IPA letters with
other Latin letters is unclear, however, and probably is not something
that could really be pinned down in the absence of ingrained, taught
practice regarding what the rules *should* be in one culture or
another. And what answer you get will be influenced by how you
ask it. If you start out with "Do you think that all letters
that are like o's should sort together by default...?" then you've
already begged your answer.

> to see:
>
> a) Interleaved:

> b) Separate but near:

> c) Like a particular language

> People I've talked to, from various different backgrounds, have expected
> behavior (a) for both letters Ã¸ and Ã¶, or occasionally (b) for them.
> *Nobody* expected the UCA-type inconsistency: behavior (c) for Ã¸, but
> behavior (a) for Ã¶.

Nobody -- not even me -- since the UCA behavior is (b) for O-slash (not
(c)), and (a) for O-diaresis.

> Moreover, this is also inconsistent with any generative use of characters
> like stroke, since they are always interfiled in UCA.

Which *is* consistent with the encoding decisions that have been
taken recently by the UTC to treat such overlaid diacritic characters
as *separately* encoded atomic characters, not decomposed into
sequences of base letters plus combining overlays.

> More accurately, you believe that the correct behavior occurs. (Sadly,
using
> BOLDFACE doesn't make it more true.)

Nor will prefacing your remarks with "Sadly" make them more convincing,
sympathetic (or lachrymose).

> But you offer no evidence. Ã. is seen as
> a separate letter in the languages that use it, but UCA "interfiles" it.
Å
> is also seen as a separate letter, and UCA doesn't. Let's hear some
evidence
> from your side, like people's reactions to the above cases.

O.k. You got my reaction. *I* think that Michael is correct, and
that the table treating the characters with overlaid and excrescent
diacritics as atomic characters that sort *near* letters they
are diacritic derivatives of, is correct behavior for a default.

> > 5) if Mark wants to make a tailoring to interfile all these letters
> > (which can only result in what I describe as "visual seasickess" to
> > any poor users who have to actually read such wordlists.
>
> Again, no evidence. Let's look at a particular example, letters based on
> "O". UCA *already* interleaves the list below (UCA O List). Adding John's
> list to that would add only the two elements:
>
> 00F8; LATIN SMALL LETTER O WITH STROKE
> 01FF; LATIN SMALL LETTER O WITH STROKE AND ACUTE
>
> I fail to see your purported user would swamped by the relative magnitude
of
> the change, which in the case of O would be adding about 1% more
interleaved
> O's. How is this addition going to cause "visual seasickness", I wonder?
>
> UCA O List
> ====================
[Superfluous list of O's cited from allkeys.txt excised]

This completely misses Michael's point, whether accidentally or
deliberately for rhetorical effect is unclear.

The "visual seasickness" has nothing whatsoever to do with what
a *tailoring* table would look like, which is why your citation of 1%
more interleaving and then citing the entire UCA O list is
completely superfluous.

What Michael is talking about is the following effect:

ofofofo
oføfofo
øfofofo
øfoføfø
ofofofp

in those instances where a user *does* consider ø to be
a separate letter, in which case the current default behavior is
better:

ofofofo
ofofofp
oføfofo
øfofofo
øfoføfø

The relative impact of this on a user and on the usability of
the resulting sorted wordlists has everything to do with
the frequency of the *letters* in question, and nothing
whatsoever to do with the number of lines in a tailoring
table.

> > 6) the Latin alphabet has a lot more than 26 letters in it. In this
> > age of the Universal Character Set, "most users" would do better to
> > get used to this than to be hobbled by older concepts.
>
> I agree with the general principle, but it has no bearing on the topic at
> hand.

Actually, it *does* have a significant bearing on the topic at
hand. For many of us, æ *is* a separate letter of the Latin alphabet,
and not a spelling equivalent of a+e. For many of us, ø *is* a
separate letter of the Latin alphabet, and not some o with an
unfamiliar diacritic on it. This is certainly the case for the
most commonly used IPA letters in wide use in dictionary
pronunciation guides and also in wide use in various language
orthographies based on IPA use. It is much less clear for the
various non-IPA letters with diacritic overlays, and in those
instances I am more open to arguments that adjusting them to
fit with particular, commonly used language's ordering expectations
might be useful for the default table.

But any such changes would, I believe, be disruptive, as well,
and the cost of the disruptions needs to be weighed against the
(arguably) incrementally better behavior which result in
the default ordering.

--Ken

Next message: Mike Ayers: "RE: Changing UCA primarly weights (bad idea)"
Previous message: Mark Davis: "Re: Changing UCA primarly weights (bad idea)"
In reply to: Kenneth Whistler: "Re: Changing UCA primary weights (bad idea)"
Next in thread: Mark Davis: "Re: Changing UCA primary weights (bad idea)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jul 09 2004 - 20:01:11 CDT