RE: UCA and Russian letter Ё from Whistler, Ken on 2012-12-26 (Unicode Mail List Archive)

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Wed, 26 Dec 2012 19:18:05 +0000

Leo asked:

> My question was narrower: assuming that the strings being compared are
> words, could it be supported without any markup?

... where "it" refers to conditional weighting based on the (identified) word boundary. And the answer to that is no, unless the word boundary was explicitly indicated with some kind of a markup character, and then the sequence of that markup character plus the target character of interest (in this case Russian Yo) was given a tailored contraction in the weight table which weighted it differently from any Russian Yo not in that particular contraction sequence.

> (NB that the "backward accents" feature is also, strictly speaking, word-based.)

A correction here. The backwards accents feature in UCA is *not* word-based. As for any other string being compared via the UCA mechanism, weights are simply assigned to *all* characters in the string. The difference for weighting when using the backwards accents feature is that secondary weight significance in comparison is calculated from the end of the string, instead of the start of the string. This works when comparing single words, but it is applied indifferently to entire strings. And it gets the correct results, by the way. Work it out: you take two strings containing entire phrases in French, which only differ by accents on some word in the middle of the string. The only difference in weights assigned will be for the secondary weights for those accents, and if you use the backwards accents feature they will be calculated from the end of the string.

Once again, let me emphasize: the UCA algorithm per se simply has no concept at all of word boundaries. It applies strictly and only to string input, which could contain *anything*.

> In other words, after adoption, LDML became prescriptive in the sense
> "don't even think of inventing any sorting rules that cannot be
> described by LDML as it stands; we're not going to augment it". The
> Quebecois were very lucky, then.

No, I think that is an incorrect characterization of the situation for LDML. It can be, and at times has been, augmented for new parameterizations which make sense. Those changes, however, have to make sense within the overall context of the way the multilevel weighting and string comparison algorithm works. The basic issue here is that because UCA is a string weighting and comparison algorithm, but does *not* have built in any kind of text segmentation logic (whether to identify words, syllables, or any other language-specific segment), it simply does not make sense to expect LDML to be augmented to describe collation behavior that depends on conditional behavior at segmentation boundaries. That is simply outside the scope of UCA and LDML. It isn't outside the scope of the bigger issue of sorting and collation behavior in general, of course -- it is just outside the scope of what UCA addresses.

Incidentally, for the record, backwards weighting of accents for French doesn't have anything particular to do with Quebecois. It is a feature of *some* influential French dictionary lexicographical ordering traditions -- in France -- and not in others.

--Ken
Received on Wed Dec 26 2012 - 13:20:55 CST

This archive was generated by hypermail 2.2.0 : Wed Dec 26 2012 - 13:20:56 CST