Re: UCA and Russian letter Ё from Leo Broukhis on 2012-12-22 (Unicode Mail List Archive)

From: Leo Broukhis <leob_at_mailcom.com>
Date: Sat, 22 Dec 2012 19:40:58 -0800

On Fri, Dec 21, 2012 at 1:49 PM, Whistler, Ken <ken.whistler_at_sap.com> wrote:
> Leo Broukhis said:
>
>> Granted, not yet, but by itself the argument is invalid. Unicode
>> collation rules are descriptive;
>
> I'm not sure what you mean by that. UTS #10 is a *specification* of an algorithm, with various options for tailoring and parameterization which make it possible to accommodate various needs for particular cases. It is not intended as a descriptive mechanism.

What I meant is that the way its feature set which makes it flexible
enough had been decided was descriptive, or, if you will, adaptive,
following the pre-existing collation traditions in various languages
and/or pre-existing standards.

> Perhaps you are referring to LDML, which includes a formal mechanism for describing a particular collation in terms of the default table and tailoring options and parameterization options of the UCA.

You're right, LDML may be a better application of the word "descriptive".

>> if, for example, a language happens to sort accents backwards, this
>> rule has to be - and is - accommodated despite its apparent
>> illogicality;
>
> Backwards accent secondary weighting was actually included primarily because of prior art in collation standards, because of the need to be able to synchronize the UCA algorithm with ISO 14651, and because it makes it easier to explain how folks can implement versions of multi-level collation which can pass the conformance tests of the Canadian sorting standard, etc.
>
>> along the same lines, if a language happens to make a distinction
>> discussed in this thread, it has to be accommodated just as well.
>
> No, I don't think so.

My question can be construed as a hypothetical: had the described
Ё-collation been a prior art in collation standards by the time of
development of LDML and the UCA, how different would they have been?
I'm hoping for an answer "not at all" or "very little", and "here's
how it could have been implemented: ...".

> It is rather easy to come up with distinctions or collation requirements which simply cannot be accommodated within the intended bounds of the UCA. For example, sorting all numerical expressions mixed with text strictly by their numeric values, or sorting all (or some specified list) of abbreviations as if they were spelled out, and so forth.

> Many lexicographical ordering rules cannot be fully accommodated within the context of the UCA algorithm, which is a multilevel *string comparison* specification, and not a dictionary ordering specification.

That is true in general if rules happen to involve semantics, but
we're discussing a formal rule here.
Imagine that the backward accents feature was missing from LDML, e.g.
because it was an emerging trend rather than a standard way of
collation at the time of formalizing LDML, thus not included in LDML.
Would you have said the same about it today if someone had asked about
supporting it ? If not, why not?

>>
>> My question is as follows: does UCA have to be modified (e.g. by
>> adding another bit flag "word-initial primary" next to the existing
>> "backward secondary") to support the feature if it were to be
>> implemented, or is there a way to achieve the "new Russian online
>> collation" within the existing UCA without modifying the strings to
>> be sorted before the application of the algorithm?
>
> I don't think there is any out-of-the-box way to use UCA so that an implementation would automatically recognize a word boundary context and weight characters conditionally based on that context. So no, I don't think you could get an implementation to do that without first marking up text with additional characters to indicate word boundaries and then tailoring the weight table to weight sequences including that markup accordingly.

My question was narrower: assuming that the strings being compared are
words, could it be supported without any markup?
(NB that the "backward accents" feature is also, strictly speaking, word-based.)

> But there is another possible sense of the question, "does UCA have to be modified... to support...", i.e. is the UTC somehow required to augment the algorithm to support some particular kind of behavior for a particular language's sorting rules, just because someone has turned up particular odd behavior. And I think the answer to that is clearly no. Oh, and by the way, I don't think LDML must (or should) be augmented to enable it to describe any and all lexicographical ordering practices, either. That isn't the function of LDML.

In other words, after adoption, LDML became prescriptive in the sense
"don't even think of inventing any sorting rules that cannot be
described by LDML as it stands; we're not going to augment it". The
Quebecois were very lucky, then.

Leo
Received on Sat Dec 22 2012 - 21:47:34 CST

This archive was generated by hypermail 2.2.0 : Sat Dec 22 2012 - 21:47:42 CST