Re: UCA and Russian letter Ё from Leif Halvard Silli on 2012-12-21 (Unicode Mail List Archive)

From: Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no>
Date: Fri, 21 Dec 2012 20:05:21 +0100

Leo Broukhis, Fri, 21 Dec 2012 08:57:11 -0800:
> On Fri, Dec 21, 2012 at 4:56 AM, Leif Halvard Silli wrote:
>>
>> You say that the difference is primary in the beginning of a word but
>> elsewhere secondary. And yes, that orthographic dictionary that you
>> link to above, looks as you describe.
>>
>> However, in reality, the difference is secondary - if that is the right
>> word - even as the first letter in a word. Wikipedia has the following
>> example: едок > ёж > ездит.[1] And, for instance the word ёлка could
>> also be written елка.
>
>> [1] <http://en.wikipedia.org/wiki/%d0%81#Russian>
>
> Wikipedia's example is sadly unsourced, unlike mine.

My Moscow Russian-Norwegian from 1987 and my Pocket Oxford Russian
Dictionary from 2003 agree that both list words on Ё and Е under the
same category – namely, under the letter Е. Also, the Russian
wikipedia article on the letter Ё says as well that this is how sorting
should happen.
<http://ru.wikipedia.org/wiki/%d0%81#.D0.A1.D0.BE.D1.80.D1.82.D0.B8.D1.80.D0.BE.D0.B2.D0.BA.D0.B0>
And the article list xindy as one applications that handles this.
<http://en.wikipedia.org/wiki/Xindy>

>> Hence I would argue that the dictionary you linked to above considers
>> the difference to *always* be secondary. It is just that the dictionary
>> applies the sorting algorithm to a collection where the words that
>> begins with the letter Ё has been separated from words that begins on
>> the letter Е.
>
> Isn't that notionally the same as having the difference primary for
> the first letter?

Input from a coalition expert would be welcome. However, this is how I
think:

Should one expect such an algorithm to write the phone book on one’s
behalf? Or that it writes the dictionary? I think that would be an
unrealistic expectation. E.g. a dictionary or phone book has precise
rules for how the words as written and grouped before they are sorted.

Fact is, again, that ёлка - "in the wild" - can be written ёлка and
елка. So if you assume that the algorithm should only deal with ёлка,
then you are also saying that you want the algorithm to deal with words
that have been prepared for sorting. Thus you are talking about a well
prepared text were ёлка is always written ёлка and not елка.

While not a definitive "proof", I may also mention that the CSS list
module defines an enumeration style based on the Russian alphabet, in
which the ё is excluded.

http://www.w3.org/TR/css3-lists/#lower-russian

>>> A cursory scan of the UCA doesn't reveal if that's implementable, and
>>> experiments in a fairly fresh Linux Mint yield either
>>> ель < ёлка < тель < тёлка or ель < тель < тёлка < ёлка depending on
>>> the LANG setting (en_US works better than ru_RU).
>>
>> (Both examples consider the difference primary, but the the last
>> example is incorrect as the ёлка follows after the тёлка - which is
>> incorrect from every angle (except from the angle of the number of the
>> letter inside Unicode.)
>
> Right. And, ironically, the [en] collation is the correct one.

Perhaps this bug is because the Russian localizers failed to get it the
way they wanted: Full alignment of Е and Ё? ;-)

>>> Could someone tell if the UCA in its current form is able to support that?
>>
>> Is there not a need for 3 kinds of sorting? Namely: a) Е/Ё as always
>> distinct letters, b) Е/Ё as always non-distinct letters, c) Е/Ё as
>> non-distinct letters except when used as the first letter. (Note that
>> the last variant would only be yield correct result on collections of
>> words where a first-letter Ё is guaranteed be rendered with a Ё. Thus,
>> if ёлка is written елка, then the result becomes incorrect.)
>
> We're not talking here about *words per se* that may or may not be
> rendered with a Ё, we're talking about letter sequences with Ё as a
> given. The dictionary order shows that all word-initial Ёs go after
> all word-initial Еs, but within a word the difference is secondary.
> For a set of letter sequences using canonical spelling of words, the
> collation algorithm should give their dictionary ordering, shouldn't
> it?

I believe the English Wikipedia article is pretty "canonical" when it
says that it can be done both ways - see the sources I pointed to above
for examples of sorting where the status as first letter doesn't matter.

I don't know why the dictionary you pointed two
<http://ru.wikisource.org/wiki/%d0%9e%d1%80%d1%84%d0%be%d0%b3%d1%80%d0%b0%d1%84%d0%b8%d1%87%d0%b5%d1%81%d0%ba%d0%b8%d0%b9_%d1%81%d0%bb%d0%be%d0%b2%d0%b0%d1%80%d1%8c_%d1%80%d1%83%d1%81%d1%81%d0%ba%d0%be%d0%b3%d0%be_%d1%8f%d0%b7%d1%8b%d0%ba%d0�>
has separated the words. It could be a technical limitation of
MediaWiki. Or it could be because those who initiated the project felt
it made the most sense. (It does make a lot of sense to me … he, he.)
But that dictionary is also "peculiar" in that it lists words that
begins on the letter "Ы". :-) It is typical to say that no words begins
on the letter Ы. :-) But the list managed to find some … (Including one
word that simply means "to say ы".) Neither of the dictionaries I
mentioned above have any words under the letter Ы. Even in the above
mentioned CSS list module’s definition, the ы is excluded.

> Re the linguistic PS: you're right, and that proves that an
> approximation to the proper collation using secondary ordering is
> preferred to an approximation using primary ordering.

Probably.

-- 
leif halvard silli

Received on Fri Dec 21 2012 - 13:07:19 CST

This archive was generated by hypermail 2.2.0 : Fri Dec 21 2012 - 13:07:19 CST