Re: Changing UCA primary weights (bad idea)

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jul 09 2004 - 16:11:31 CDT

Next message: E. Keown: "Re: Arabic written in Syriac? Arabic written in Tifinagh?"

Previous message: Peter Kirk: "Re: Looking for transcription or transliteration standards latin- >arabic"
Next in thread: Mark Davis: "Re: Changing UCA primary weights (bad idea)"
Reply: Mark Davis: "Re: Changing UCA primary weights (bad idea)"
Reply: Mark Davis: "Re: Changing UCA primary weights (bad idea)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Subject: Re: Changing UCA primarly weights (bad idea)
                            ^^^^^^^^

Correcting the subject, just because it bugs me...

> You are certainly right that this is not a slam-dunk; there are reasons for
> and against it. And it may well be that the committee decides against it.

Yes.

> However, you overstate the situation with tailorings. The only tailorings
> that would be affected are ones where the tailoring depends on inheriting
> the order from the UCA for the affected characters.

> So the number of tailorings that in practice would be affected I suspect to
> be very small. However, if you have actual evidence of tailorings that would
> be adversely affected by John Cowan's list, I would love to see it.

The European Ordering Rules are the obvious first instance impacted
heavily by this.

And I think you may underestimate the number of tailorings that
in practice would be affected. It depends, in part, on how the
tailorings of UCA are implemented. I am positive that all of my
tailorings for Sybase will be *affected*, for example. I don't think
they will be *substantially* affected, in the sense of any complete
redefinition of how the tailoring itself is defined. But the very
fact of the proposed rearrangements will hit some internals, and
it *will* result in changed key values, which effectively version
the tailorings, as far as I am concerned.

> > 2) it proposes to reverse the *explicit* design principles that went
> > into the default tailorable template in the *first* place. Similar
> > letters are near -- but not interfiled with -- similar letters. This
> > is MORE than enough to give any casual user the functionality he
> > needs, because only in initial position is there likely to be any
> > confusion in real-life sorted word lists, and even then, hooked-b
> > follows bz, which is hardly burdensome for the end user.
>
> This also completely overstates the case. What we actually did was to put
> similar letters near other letters, *and if their decompositions were the
> same* we interfiled them. There is, however, little principled difference
> between [[Editing down the list a bit: O-slash, O-with-horn,
> and O-with-circumflex]]” that would cause a user to think that the
                                            ^^^^^^
                                            recte: a naive user unfamiliar
                                                   with IPA
> some should be interfiled and some should not.

Actually, there is: o-slash *is* treated as a separate letter in
the pronunciation guides of all IPA-based dictionaries, which constitute
the majority of the world's usage, currently. The same is not true
for various o's with miscellaneous scattered accents sprinkled on.

> In some languages these would
> be seen as "separate letters" (e.g. with different primary weights) and in
> others not; but that does not line up in any particular way with what is in
> the UCA. (see also comment below).

The more important correlation here that was attempted in the UCA was
to keep the treatment more or less in synch with the formal decompositions
provided in the UnicodeData.txt file. The more those diverge, the more
difficult maintaining the tables becomes (technically) and the more
difficult it becomes to hold the line (politically) against special
claims from one constituency or another that "their" letter be weighted
"correctly" in the default table for one reason or another.

> > 3) in discussions elsewhere, Mark has talked about what "most users"
> > "expect" and I found his suggestion to be anglocentric and
> > unsubstantiated.
>
> And I will refrain from saying what I think of your reasoning ability in
> general, although circularity seems to be a particular specialty. I suggest
> that we stick to the facts instead of ad hominem attacks.

Oh, very clever. An ad hominem attack which consists of claiming
that you will refrain from saying something ad hominem.

> For user expectations, check out how foreign words with unusual accents are
> sorted in a variety of languages. I have seen no reason to believe that
> Germans or French or others behave much differently when faced with a letter
> like Ã¸ that is not one that they use. The key is whether they would expect
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Actually, the key is whether they have any expectations based on their
own language orthography's use of it *and* whether they have any
expectations based on their own language's lexicographical practice
in using it. I claim that outside the context of the United States,
which has its own archaic and impoverished pronunciation guide
practices, the majority of dictionary users in the world have some
exposure to and familiarity with basic IPA letters. How that impacts
their expectations regarding *interleaving* of IPA letters with
other Latin letters is unclear, however, and probably is not something
that could really be pinned down in the absence of ingrained, taught
practice regarding what the rules *should* be in one culture or
another. And what answer you get will be influenced by how you
ask it. If you start out with "Do you think that all letters
that are like o's should sort together by default...?" then you've
already begged your answer.

> to see:
>
> a) Interleaved:

> b) Separate but near:

> c) Like a particular language

> People I've talked to, from various different backgrounds, have expected
> behavior (a) for both letters Ã¸ and Ã¶, or occasionally (b) for them.
> *Nobody* expected the UCA-type inconsistency: behavior (c) for Ã¸, but
> behavior (a) for Ã¶.

Nobody -- not even me -- since the UCA behavior is (b) for O-slash (not
(c)), and (a) for O-diaresis.

> Moreover, this is also inconsistent with any generative use of characters
> like stroke, since they are always interfiled in UCA.

Which *is* consistent with the encoding decisions that have been
taken recently by the UTC to treat such overlaid diacritic characters
as *separately* encoded atomic characters, not decomposed into
sequences of base letters plus combining overlays.

> More accurately, you believe that the correct behavior occurs. (Sadly, using
> BOLDFACE doesn't make it more true.)

Nor will prefacing your remarks with "Sadly" make them more convincing,
sympathetic (or lachrymose).

> But you offer no evidence. Ã… is seen as
> a separate letter in the languages that use it, but UCA "interfiles" it. Å
> is also seen as a separate letter, and UCA doesn't. Let's hear some evidence
> from your side, like people's reactions to the above cases.

O.k. You got my reaction. *I* think that Michael is correct, and
that the table treating the characters with overlaid and excrescent
diacritics as atomic characters that sort *near* letters they
are diacritic derivatives of, is correct behavior for a default.

> > 5) if Mark wants to make a tailoring to interfile all these letters
> > (which can only result in what I describe as "visual seasickess" to
> > any poor users who have to actually read such wordlists.
>
> Again, no evidence. Let's look at a particular example, letters based on
> "O". UCA *already* interleaves the list below (UCA O List). Adding John's
> list to that would add only the two elements:
>
> 00F8; LATIN SMALL LETTER O WITH STROKE
> 01FF; LATIN SMALL LETTER O WITH STROKE AND ACUTE
>
> I fail to see your purported user would swamped by the relative magnitude of
> the change, which in the case of O would be adding about 1% more interleaved
> O's. How is this addition going to cause "visual seasickness", I wonder?
>
> UCA O List
> ====================
[Superfluous list of O's cited from allkeys.txt excised]

This completely misses Michael's point, whether accidentally or
deliberately for rhetorical effect is unclear.

The "visual seasickness" has nothing whatsoever to do with what
a *tailoring* table would look like, which is why your citation of 1%
more interleaving and then citing the entire UCA O list is
completely superfluous.

What Michael is talking about is the following effect:

ofofofo
oføfofo
øfofofo
øfoføfø
ofofofp

in those instances where a user *does* consider ø to be
a separate letter, in which case the current default behavior is
better:

ofofofo
ofofofp
oføfofo
øfofofo
øfoføfø

The relative impact of this on a user and on the usability of
the resulting sorted wordlists has everything to do with
the frequency of the *letters* in question, and nothing
whatsoever to do with the number of lines in a tailoring
table.

> > 6) the Latin alphabet has a lot more than 26 letters in it. In this
> > age of the Universal Character Set, "most users" would do better to
> > get used to this than to be hobbled by older concepts.
>
> I agree with the general principle, but it has no bearing on the topic at
> hand.

Actually, it *does* have a significant bearing on the topic at
hand. For many of us, æ *is* a separate letter of the Latin alphabet,
and not a spelling equivalent of a+e. For many of us, ø *is* a
separate letter of the Latin alphabet, and not some o with an
unfamiliar diacritic on it. This is certainly the case for the
most commonly used IPA letters in wide use in dictionary
pronunciation guides and also in wide use in various language
orthographies based on IPA use. It is much less clear for the
various non-IPA letters with diacritic overlays, and in those
instances I am more open to arguments that adjusting them to
fit with particular, commonly used language's ordering expectations
might be useful for the default table.

But any such changes would, I believe, be disruptive, as well,
and the cost of the disruptions needs to be weighed against the
(arguably) incrementally better behavior which result in
the default ordering.

--Ken

Next message: E. Keown: "Re: Arabic written in Syriac? Arabic written in Tifinagh?"
Previous message: Peter Kirk: "Re: Looking for transcription or transliteration standards latin- >arabic"
Next in thread: Mark Davis: "Re: Changing UCA primary weights (bad idea)"
Reply: Mark Davis: "Re: Changing UCA primary weights (bad idea)"
Reply: Mark Davis: "Re: Changing UCA primary weights (bad idea)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jul 09 2004 - 16:12:09 CDT