RE: UCA and Russian letter Ё

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Fri, 21 Dec 2012 21:49:09 +0000

Leo Broukhis said:

> Granted, not yet, but by itself the argument is invalid. Unicode
> collation rules are descriptive;

I'm not sure what you mean by that. UTS #10 is a *specification* of an algorithm, with various options for tailoring and parameterization which make it possible to accommodate various needs for particular cases. It is not intended as a descriptive mechanism.

Perhaps you are referring to LDML, which includes a formal mechanism for describing a particular collation in terms of the default table and tailoring options and parameterization options of the UCA.

> if, for example, a language happens to sort accents backwards, this
> rule has to be - and is - accommodated despite its apparent
> illogicality;

Backwards accent secondary weighting was actually included primarily because of prior art in collation standards, because of the need to be able to synchronize the UCA algorithm with ISO 14651, and because it makes it easier to explain how folks can implement versions of multi-level collation which can pass the conformance tests of the Canadian sorting standard, etc.

> along the same lines, if a language happens to make a distinction
> discussed in this thread, it has to be accommodated just as well.

No, I don't think so.

It is rather easy to come up with distinctions or collation requirements which simply cannot be accommodated within the intended bounds of the UCA. For example, sorting all numerical expressions mixed with text strictly by their numeric values, or sorting all (or some specified list) of abbreviations as if they were spelled out, and so forth.

Many lexicographical ordering rules cannot be fully accommodated within the context of the UCA algorithm, which is a multilevel *string comparison* specification, and not a dictionary ordering specification.

>
> My question is as follows: does UCA have to be modified (e.g. by
> adding another bit flag "word-initial primary" next to the existing
> "backward secondary") to support the feature if it were to be
> implemented, or is there a way to achieve the "new Russian online
> collation" within the existing UCA without modifying the strings to
> be sorted before the application of the algorithm?

I don't think there is any out-of-the-box way to use UCA so that an implementation would automatically recognize a word boundary context and weight characters conditionally based on that context. So no, I don't think you could get an implementation to do that without first marking up text with additional characters to indicate word boundaries and then tailoring the weight table to weight sequences including that markup accordingly.

This is actually derived trivially from the fact that UCA knows nothing whatsoever about word boundaries. At core, it is just a mechanism to take a string input and provide an output vector of collation weights. You would have to have to hook it up to a text segmentation algorithm to even identify "words", and then that text segmentation algorithm would itself have to be tailored and tuned to whatever language you had in mind, because the criteria for identifying "words" will vary from language to language, and even orthography to orthography.

But there is another possible sense of the question, "does UCA have to be modified... to support...", i.e. is the UTC somehow required to augment the algorithm to support some particular kind of behavior for a particular language's sorting rules, just because someone has turned up particular odd behavior. And I think the answer to that is clearly no. Oh, and by the way, I don't think LDML must (or should) be augmented to enable it to describe any and all lexicographical ordering practices, either. That isn't the function of LDML.

--Ken
Received on Fri Dec 21 2012 - 15:50:31 CST

This archive was generated by hypermail 2.2.0 : Fri Dec 21 2012 - 15:50:31 CST