Re: UTS #10: Unicode Collation Algorithm (UCA)

From: Mark Davis (
Date: Sun Jul 21 2002 - 02:41:24 EDT

> Step 1. The authors of the New Collation Algorithm start thinking of it as
> real problem, not a fairy tale told by myself :-).

The authors and the Unicode Technical Committee are well aware that the
Default Unicode Collation Element Table (DUCET) will require tailoring to
meet the ordering requirements of a great many languages.

> Step 2. They state somewhere in Unicode documentation that some categories
> of users can get CUR (Culturally Unexpected Results).

That is already done.See, item #6 (which is
repeated in various forms in other places):

"linguistic applicability: to meet most user expectations, a linguistic
tailoring is needed. For more information, see §5 Tailoring."

For example, in ICU (a particular implementation of the UCA), you can tailor
the base table as you wish, for a Persian collation. (ICU is not the only
implementation, but it is the one that I am most familiar with.) E.g. for
German Phonebook ordering, the tailoring is given by:
&ae <<< ä
&AE <<< Ä
&oe <<< ö
&OE <<< Ö
&ue <<< ü
&UE <<< Ü

The syntax for the tailoring is described in the ICU User Guide. It is a
superset of what is in Java, but for your purposes everything you need is
the same.) Now, I just checked on Persian, and it appears that ICU doesn't
tailor the way you want; you could file a bug to change it to the right
ordering (we also periodically forward such changes to the Java people, in
case they wish to incorporate them.)

ICU User Guide:
German Phonebook tailoring:
(you can navigate to other locale data from there)
Filing a bug:

The Unicode Consortium itself does not have a set of data for tailorings for
different languages / locales. To get any particular vendor to fix its
orderings, you would probably have to contact that vendor. There is an ISO
registry, but it is not very good and mostly ignored.

► “Eppur si muove” ◄

----- Original Message -----
From: "Vladimir Ivanov" <>
To: "Mark Davis" <>
Cc: <>
Sent: Saturday, July 20, 2002 22:23
Subject: Re: UTS #10: Unicode Collation Algorithm (UCA)

> In our discussion about the new Unicode Collation Algorithm presented on
> one old question remains.
> First of all, it is a great job and we must appreciate it at its true
> Among the most significant changes "Notes on notes on linguistic
> applicability" are mentioned. In paragraph "3.1 Linguistic Features" it is
> written "Linguistic requirements of collation are covered in more detail
> The Unicode Standard, Version 3.0".
> In Section 5.17 Sorting and Searching of the latter document we can find
> interesting general reasoning about Culturally Expected Sorting. For
> example, East Asian ideographs, phonetic sorting of Han characters, German
> and Swedish types of sorting 'a' with diaeresis etc are mentioned. I think
> it is important to mention here the existence of 2 different types of
> sorting as well.
> As I have pointed out earlier, in Arabic Alphabet, which is used in Arab
> countries, 3 last letters are He(h), Waw and Ye(h). Whereas in countries
> languages listed below the order is Waw, He(h) and Ye(h). The latter
> ordering is applicable to:
> 1) Iran, Afghanistan and Pakistan, where the Arabic script is official.
> Overall population ca 230 millions.
> 2) Kurdish in Iran, Iraq, Turkey and some regions of former Soviet Union.
> 3) Tajikistan, where the Arabic script is semi-official (at least all the
> pupils and students study it).
> 4) Dozens of millions people in India (especially in the regions adjacent
> Pakistan).
> You can add dictionary developers, teachers and linguists all over the
> who work for the development of cultural contacts with the regions and
> peoples listed above and need convenient software for right sorting Arabic
> words and names.
> The ordering of letters in the new collation is presented in
> It has no contradictions
> with the recent Iranian official document Dastur-e xatt-e Farsi mosavvab-e
> Farhangestan-e zaban-o adab-e Farsi (Instructions on Persian Script
> by Persian Academy of Sciences), Tehran, 2002 at the exception of the
> problem mentioned above (see p.19).
> To simplify the question let's pick out 30 positions from the All Keys
> beginning with code 0647 up to FBA4 that represent various kinds of letter
> Heh and call it Group 1. Then pick out next 29 positions beginning with
> 0648 up to 06CF that represent various kinds of letter Waw and call it
> 2. IMHO somewhere (may be on a higher level of collation) a branch should
> provided where Group 1 follows Group 2.
> Now that we are developing a new Persian-Russian Dictionary with 100,000
> entries the existing collation algorithm can be used only as a draft
> procedure with light years of manual corrections :-).
> Thus our problem can be solved in a few simple steps:
> Step 1. The authors of the New Collation Algorithm start thinking of it as
> real problem, not a fairy tale told by myself :-).
> Step 2. They state somewhere in Unicode documentation that some categories
> of users can get CUR (Culturally Unexpected Results).
> Step 3. Official ISO representatives provide necessary proposals.
> Step 4. Big software manufacturers like Microsoft make some corrections
> dictionary developers.
> Thank you,
> Vladimir Ivanov

This archive was generated by hypermail 2.1.2 : Sun Jul 21 2002 - 01:03:40 EDT