Re: Changing UCA primary weights (bad idea)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon Jul 12 2004 - 11:18:35 CDT

Next message: Mark Davis: "Re: Changing UCA primary weights (bad idea)"

Previous message: Geoff Back: "RE: Problems Reading Saved Files With Unicode Names"
In reply to: Kenneth Whistler: "Re: Changing UCA primary weights (bad idea)"
Next in thread: Markus Scherer: "Re: Changing UCA primary weights (bad idea)"
Reply: Markus Scherer: "Re: Changing UCA primary weights (bad idea)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

>I am positive that all of my
tailorings for Sybase will be *affected*, for example. I don't think
they will be *substantially* affected, in the sense of any complete
redefinition of how the tailoring itself is defined. I don't think
they will be *substantially* affected, in the sense of any complete
redefinition of how the tailoring itself is defined. But the very
fact of the proposed rearrangements will hit some internals, and
it *will* result in changed key values, which effectively version
the tailorings, as far as I am concerned.

We know that if the UCA is changed in any way, then any tailoring is
affected in that it will produce a different ordering for some characters
(any that it does not explicitly override). So any implementation's
versioning scheme must take account of this. This will always be the case,
unless we completely freeze the UCA, disallowing fixes for, say, Indic
characters. But the UTC clearly has not agreed to do this in the past; while
stability is very important, we have left ourselves the ability to make
changes in the UCA when warranted. So the key here is to assess whether any
proposed changes are warranted.

So the question is whether Sybase tailorings, such as German, will be
affected positively or negatively, and to what degree. If a German customer
is accessing a database full of European names, and expects to find Ę with
E, and Ą with A and Ż with Z and Ł with L, then he will be right *except*
for the last one.

And this has consequences; it is not just an academic exercise. If s/he
expects that a database SELECT of all client names starting with "L" will
include the "Ł" names also, then s/he will get the wrong answer in a
financial report -- and probably not realizing it is wrong. If s/he looks
for a client name Słownik* within a page of Sl... and doesn't think to look
3 pages down after Sz, then s/he will get the wrong answer. If s/he searches
for a name within a body of text using a weak language-sensitive match, and
doesn't find it, then s/he will get the wrong answer.

‎Mark

* Made-up name; someone might give better example.

----- Original Message -----
From: "Kenneth Whistler" <kenw@sybase.com>
To: <mark.davis@jtcsv.com>
Cc: <unicode@unicode.org>
Sent: Friday, July 09, 2004 14:11
Subject: Re: Changing UCA primary weights (bad idea)

> Subject: Re: Changing UCA primarly weights (bad idea)
^^^^^^^^

Correcting the subject, just because it bugs me...

> You are certainly right that this is not a slam-dunk; there are reasons
for
> and against it. And it may well be that the committee decides against it.

Yes.

> However, you overstate the situation with tailorings. The only tailorings
> that would be affected are ones where the tailoring depends on inheriting
> the order from the UCA for the affected characters.

> So the number of tailorings that in practice would be affected I suspect
to
> be very small. However, if you have actual evidence of tailorings that
would
> be adversely affected by John Cowan's list, I would love to see it.

The European Ordering Rules are the obvious first instance impacted
heavily by this.

And I think you may underestimate the number of tailorings that
in practice would be affected. It depends, in part, on how the
tailorings of UCA are implemented. I am positive that all of my
tailorings for Sybase will be *affected*, for example. I don't think
they will be *substantially* affected, in the sense of any complete
redefinition of how the tailoring itself is defined. But the very
fact of the proposed rearrangements will hit some internals, and
it *will* result in changed key values, which effectively version
the tailorings, as far as I am concerned.

> > 2) it proposes to reverse the *explicit* design principles that went
> > into the default tailorable template in the *first* place. Similar
> > letters are near -- but not interfiled with -- similar letters. This
> > is MORE than enough to give any casual user the functionality he
> > needs, because only in initial position is there likely to be any
> > confusion in real-life sorted word lists, and even then, hooked-b
> > follows bz, which is hardly burdensome for the end user.
>
> This also completely overstates the case. What we actually did was to put
> similar letters near other letters, *and if their decompositions were the
> same* we interfiled them. There is, however, little principled difference
> between [[Editing down the list a bit: O-slash, O-with-horn,
> and O-with-circumflex]]" that would cause a user to think that the
                                            ^^^^^^
                                            recte: a naive user unfamiliar
                                                   with IPA
> some should be interfiled and some should not.

Actually, there is: o-slash *is* treated as a separate letter in
the pronunciation guides of all IPA-based dictionaries, which constitute
the majority of the world's usage, currently. The same is not true
for various o's with miscellaneous scattered accents sprinkled on.

> In some languages these would
> be seen as "separate letters" (e.g. with different primary weights) and in
> others not; but that does not line up in any particular way with what is
in
> the UCA. (see also comment below).

The more important correlation here that was attempted in the UCA was
to keep the treatment more or less in synch with the formal decompositions
provided in the UnicodeData.txt file. The more those diverge, the more
difficult maintaining the tables becomes (technically) and the more
difficult it becomes to hold the line (politically) against special
claims from one constituency or another that "their" letter be weighted
"correctly" in the default table for one reason or another.

> > 3) in discussions elsewhere, Mark has talked about what "most users"
> > "expect" and I found his suggestion to be anglocentric and
> > unsubstantiated.
>
> And I will refrain from saying what I think of your reasoning ability in
> general, although circularity seems to be a particular specialty. I
suggest
> that we stick to the facts instead of ad hominem attacks.

Oh, very clever. An ad hominem attack which consists of claiming
that you will refrain from saying something ad hominem.

> For user expectations, check out how foreign words with unusual accents
are
> sorted in a variety of languages. I have seen no reason to believe that
> Germans or French or others behave much differently when faced with a
letter
> like Ã¸ that is not one that they use. The key is whether they would
expect
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Actually, the key is whether they have any expectations based on their
own language orthography's use of it *and* whether they have any
expectations based on their own language's lexicographical practice
in using it. I claim that outside the context of the United States,
which has its own archaic and impoverished pronunciation guide
practices, the majority of dictionary users in the world have some
exposure to and familiarity with basic IPA letters. How that impacts
their expectations regarding *interleaving* of IPA letters with
other Latin letters is unclear, however, and probably is not something
that could really be pinned down in the absence of ingrained, taught
practice regarding what the rules *should* be in one culture or
another. And what answer you get will be influenced by how you
ask it. If you start out with "Do you think that all letters
that are like o's should sort together by default...?" then you've
already begged your answer.

> to see:
>
> a) Interleaved:

> b) Separate but near:

> c) Like a particular language

> People I've talked to, from various different backgrounds, have expected
> behavior (a) for both letters Ã¸ and Ã¶, or occasionally (b) for them.
> *Nobody* expected the UCA-type inconsistency: behavior (c) for Ã¸, but
> behavior (a) for Ã¶.

Nobody -- not even me -- since the UCA behavior is (b) for O-slash (not
(c)), and (a) for O-diaresis.

> Moreover, this is also inconsistent with any generative use of characters
> like stroke, since they are always interfiled in UCA.

Which *is* consistent with the encoding decisions that have been
taken recently by the UTC to treat such overlaid diacritic characters
as *separately* encoded atomic characters, not decomposed into
sequences of base letters plus combining overlays.

> More accurately, you believe that the correct behavior occurs. (Sadly,
using
> BOLDFACE doesn't make it more true.)

Nor will prefacing your remarks with "Sadly" make them more convincing,
sympathetic (or lachrymose).

> But you offer no evidence. Ã. is seen as
> a separate letter in the languages that use it, but UCA "interfiles" it.
Å
> is also seen as a separate letter, and UCA doesn't. Let's hear some
evidence
> from your side, like people's reactions to the above cases.

O.k. You got my reaction. *I* think that Michael is correct, and
that the table treating the characters with overlaid and excrescent
diacritics as atomic characters that sort *near* letters they
are diacritic derivatives of, is correct behavior for a default.

> > 5) if Mark wants to make a tailoring to interfile all these letters
> > (which can only result in what I describe as "visual seasickess" to
> > any poor users who have to actually read such wordlists.
>
> Again, no evidence. Let's look at a particular example, letters based on
> "O". UCA *already* interleaves the list below (UCA O List). Adding John's
> list to that would add only the two elements:
>
> 00F8; LATIN SMALL LETTER O WITH STROKE
> 01FF; LATIN SMALL LETTER O WITH STROKE AND ACUTE
>
> I fail to see your purported user would swamped by the relative magnitude
of
> the change, which in the case of O would be adding about 1% more
interleaved
> O's. How is this addition going to cause "visual seasickness", I wonder?
>
> UCA O List
> ====================
[Superfluous list of O's cited from allkeys.txt excised]

This completely misses Michael's point, whether accidentally or
deliberately for rhetorical effect is unclear.

The "visual seasickness" has nothing whatsoever to do with what
a *tailoring* table would look like, which is why your citation of 1%
more interleaving and then citing the entire UCA O list is
completely superfluous.

What Michael is talking about is the following effect:

ofofofo
oføfofo
øfofofo
øfoføfø
ofofofp

in those instances where a user *does* consider ø to be
a separate letter, in which case the current default behavior is
better:

ofofofo
ofofofp
oføfofo
øfofofo
øfoføfø

The relative impact of this on a user and on the usability of
the resulting sorted wordlists has everything to do with
the frequency of the *letters* in question, and nothing
whatsoever to do with the number of lines in a tailoring
table.

> > 6) the Latin alphabet has a lot more than 26 letters in it. In this
> > age of the Universal Character Set, "most users" would do better to
> > get used to this than to be hobbled by older concepts.
>
> I agree with the general principle, but it has no bearing on the topic at
> hand.

Actually, it *does* have a significant bearing on the topic at
hand. For many of us, æ *is* a separate letter of the Latin alphabet,
and not a spelling equivalent of a+e. For many of us, ø *is* a
separate letter of the Latin alphabet, and not some o with an
unfamiliar diacritic on it. This is certainly the case for the
most commonly used IPA letters in wide use in dictionary
pronunciation guides and also in wide use in various language
orthographies based on IPA use. It is much less clear for the
various non-IPA letters with diacritic overlays, and in those
instances I am more open to arguments that adjusting them to
fit with particular, commonly used language's ordering expectations
might be useful for the default table.

But any such changes would, I believe, be disruptive, as well,
and the cost of the disruptions needs to be weighed against the
(arguably) incrementally better behavior which result in
the default ordering.

--Ken

Next message: Mark Davis: "Re: Changing UCA primary weights (bad idea)"
Previous message: Geoff Back: "RE: Problems Reading Saved Files With Unicode Names"
In reply to: Kenneth Whistler: "Re: Changing UCA primary weights (bad idea)"
Next in thread: Markus Scherer: "Re: Changing UCA primary weights (bad idea)"
Reply: Markus Scherer: "Re: Changing UCA primary weights (bad idea)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jul 12 2004 - 11:19:55 CDT