RE: UCA and Russian letter Ё from Whistler, Ken on 2012-12-26 (Unicode Mail List Archive)

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Wed, 26 Dec 2012 18:57:56 +0000

The UCA algorithm itself has no "opinion" on this issue. It is simply a specification of *how* to compare strings at multiple levels, given a multi-level collation weight table.

The UCA *does* have a default behavior, of course, based on the DUCET table. And the DUCET table puts all Unicode characters in *some* order, so there is a default answer for Russian Ye and Yo, as there is for everything else. The current default answer for UCA 6.2 (abbreviating the unnecessary 4th level weights) is:

0435 ; [.19D9.0020.0002] # CYRILLIC SMALL LETTER IE
0450 ; [.19D9.0020.0002][.0000.0035.0002] # CYRILLIC SMALL LETTER IE WITH GRAVE
0451 ; [.19D9.0020.0002][.0000.0047.0002] # CYRILLIC SMALL LETTER IO

So by default, DUCET weights Ye with grave as a secondary difference from Ye, and also weights Yo as a secondary difference from Ye. (The secondary weights can be seen in the second collation elements for those letters, the 0035 and 0047 weights, respectively.)

Those weights would be applied to *all* instances of Ye and Yo in a string, because there is no concept in the algorithm of conditional weighting in particular positions in a word.

But it is important to note also that those weights are just defaults, and the concept here is that they are set up to be defaults for the Cyrillic script as a whole, and not as defaults for Russian language data in particular. The defaults were chosen so that any particular language written with the Cyrillic script (including Russian) doesn't get *too* screwed up if strings in it are sorted using the default table, but the default is not intended to be fully correct for *any* particular language, including Russian. Instead, that is what tailoring (using LDML or some other mechanism) is aimed at.

So I would say that UCA per se is not meant to "solve the issue" of how to collate Russian Ye and Yo. It is meant to provide a mechanism for tailoring weights for characters to provide appropriate collation orders for particular languages.

However, in some cases, where languages require collation rules that depend on boundary conditions, the algorithm by itself cannot handle those. But appropriate markup of text to *indicate* boundaries explicitly, and then to tailor the weights of the characters used for that markup, can result in strings which then *could* be compared using UCA to provide the expected results. That kind of markup could be done by a preprocessing step, which could, for example, process for word or syllabic boundaries (according to particular language and orthographic rules) and then pass the marked-up text to the string comparison step.

But in any case, it isn't the job of UCA to arbitrate what the correct or expected result for comparison in a particular language is.

--Ken

> A basic question: does the UCA algorithm consider the Russian Ye and the
> Russian Yo as equal with regard to sort order? Or is it not meant to solve
> that issue?
>
> Leif Halvard Silli
Received on Wed Dec 26 2012 - 13:03:47 CST

This archive was generated by hypermail 2.2.0 : Wed Dec 26 2012 - 13:03:48 CST