Re: UCA and Russian letter Ё from Leif H Silli on 2012-12-23 (Unicode Mail List Archive)

From: Leif H Silli <xn--mlform-iua_at_xn--mlform-iua.no>
Date: Sun, 23 Dec 2012 16:14:47 +0100

Ken,

A basic question: does the UCA algorithm consider the Russian Ye and the
Russian Yo as equal with regard to sort order? Or is it not meant to solve
that issue?

Leif Halvard Silli

------- Opprinnelig melding -------
> Fra: Whistler, Ken <ken.whistler_at_sap.com>
> Til: leob_at_mailcom.com, jkorpela_at_cs.tut.fi
> Cc: unicode_at_unicode.org, ken.whistler_at_sap.com
> Sendt: 21/12/'12, 22:49
>
> Leo Broukhis said:
>
>> Granted, not yet, but by itself the argument is invalid. Unicode
>> collation rules are descriptive;
>
> I'm not sure what you mean by that. UTS #10 is a *specification* of an
> algorithm, with various options for tailoring and parameterization which
> make it possible to accommodate various needs for particular cases. It is
> not intended as a descriptive mechanism.
>
> Perhaps you are referring to LDML, which includes a formal mechanism for
> describing a particular collation in terms of the default table and
> tailoring options and parameterization options of the UCA.
>
>> if, for example, a language happens to sort accents backwards, this
>> rule has to be - and is - accommodated despite its apparent
>> illogicality;
>
> Backwards accent secondary weighting was actually included primarily
> because of prior art in collation standards, because of the need to be
> able to synchronize the UCA algorithm with ISO 14651, and because it makes
> it easier to explain how folks can implement versions of multi-level
> collation which can pass the conformance tests of the Canadian sorting
> standard, etc.
>
>> along the same lines, if a language happens to make a distinction
>> discussed in this thread, it has to be accommodated just as well.
>
> No, I don't think so.
>
> It is rather easy to come up with distinctions or collation requirements
> which simply cannot be accommodated within the intended bounds of the UCA.
> For example, sorting all numerical expressions mixed with text strictly by
> their numeric values, or sorting all (or some specified list) of
> abbreviations as if they were spelled out, and so forth.
>
> Many lexicographical ordering rules cannot be fully accommodated within
> the context of the UCA algorithm, which is a multilevel *string
> comparison* specification, and not a dictionary ordering specification.
>
>>
>> My question is as follows: does UCA have to be modified (e.g. by
>> adding another bit flag "word-initial primary" next to the existing
>> "backward secondary") to support the feature if it were to be
>> implemented, or is there a way to achieve the "new Russian online
>> collation" within the existing UCA without modifying the strings to
>> be sorted before the application of the algorithm?
>
> I don't think there is any out-of-the-box way to use UCA so that an
> implementation would automatically recognize a word boundary context and
> weight characters conditionally based on that context. So no, I don't
> think you could get an implementation to do that without first marking up
> text with additional characters to indicate word boundaries and then
> tailoring the weight table to weight sequences including that markup
> accordingly.
>
> This is actually derived trivially from the fact that UCA knows nothing
> whatsoever about word boundaries. At core, it is just a mechanism to take
> a string input and provide an output vector of collation weights. You
> would have to have to hook it up to a text segmentation algorithm to even
> identify "words", and then that text segmentation algorithm would itself
> have to be tailored and tuned to whatever language you had in mind,
> because the criteria for identifying "words" will vary from language to
> language, and even orthography to orthography.
>
> But there is another possible sense of the question, "does UCA have to be
> modified... to support...", i.e. is the UTC somehow required to augment
> the algorithm to support some particular kind of behavior for a particular
> language's sorting rules, just because someone has turned up particular
> odd behavior. And I think the answer to that is clearly no. Oh, and by the
> way, I don't think LDML must (or should) be augmented to enable it to
> describe any and all lexicographical ordering practices, either. That
> isn't the function of LDML.
>
> --Ken
>
>
>
Received on Sun Dec 23 2012 - 09:24:10 CST

This archive was generated by hypermail 2.2.0 : Sun Dec 23 2012 - 09:24:11 CST