Re: UTS #10: Unicode Collation Algorithm (UCA)

From: Vladimir Ivanov (iranorus@online.ru)
Date: Sun Jul 21 2002 - 01:23:28 EDT


In our discussion about the new Unicode Collation Algorithm presented on
www.unicode.org/reports/tr10/ one old question remains.

First of all, it is a great job and we must appreciate it at its true value.
Among the most significant changes "Notes on notes on linguistic
applicability" are mentioned. In paragraph "3.1 Linguistic Features" it is
written "Linguistic requirements of collation are covered in more detail in
The Unicode Standard, Version 3.0".

In Section 5.17 Sorting and Searching of the latter document we can find
interesting general reasoning about Culturally Expected Sorting. For
example, East Asian ideographs, phonetic sorting of Han characters, German
and Swedish types of sorting 'a' with diaeresis etc are mentioned. I think
it is important to mention here the existence of 2 different types of Arabic
sorting as well.

As I have pointed out earlier, in Arabic Alphabet, which is used in Arab
countries, 3 last letters are He(h), Waw and Ye(h). Whereas in countries and
languages listed below the order is Waw, He(h) and Ye(h). The latter
ordering is applicable to:

1) Iran, Afghanistan and Pakistan, where the Arabic script is official.
Overall population ca 230 millions.

2) Kurdish in Iran, Iraq, Turkey and some regions of former Soviet Union.

3) Tajikistan, where the Arabic script is semi-official (at least all the
pupils and students study it).

4) Dozens of millions people in India (especially in the regions adjacent to
Pakistan).

You can add dictionary developers, teachers and linguists all over the world
who work for the development of cultural contacts with the regions and
peoples listed above and need convenient software for right sorting Arabic
words and names.

The ordering of letters in the new collation is presented in
www.unicode.org/reports/tr10/allkeys-3.1.1.txt. It has no contradictions
with the recent Iranian official document Dastur-e xatt-e Farsi mosavvab-e
Farhangestan-e zaban-o adab-e Farsi (Instructions on Persian Script adopted
by Persian Academy of Sciences), Tehran, 2002 at the exception of the
problem mentioned above (see p.19).

To simplify the question let's pick out 30 positions from the All Keys Table
beginning with code 0647 up to FBA4 that represent various kinds of letter
Heh and call it Group 1. Then pick out next 29 positions beginning with code
0648 up to 06CF that represent various kinds of letter Waw and call it Group
2. IMHO somewhere (may be on a higher level of collation) a branch should be
provided where Group 1 follows Group 2.

Now that we are developing a new Persian-Russian Dictionary with 100,000
entries the existing collation algorithm can be used only as a draft
procedure with light years of manual corrections :-).

Thus our problem can be solved in a few simple steps:

Step 1. The authors of the New Collation Algorithm start thinking of it as a
real problem, not a fairy tale told by myself :-).

Step 2. They state somewhere in Unicode documentation that some categories
of users can get CUR (Culturally Unexpected Results).

Step 3. Official ISO representatives provide necessary proposals.

Step 4. Big software manufacturers like Microsoft make some corrections for
dictionary developers.

Thank you,

Vladimir Ivanov



This archive was generated by hypermail 2.1.2 : Sun Jul 21 2002 - 00:04:57 EDT