Re: UTS #10: Unicode Collation Algorithm (UCA)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Sun Jul 21 2002 - 02:41:24 EDT


> Step 1. The authors of the New Collation Algorithm start thinking of it as
a
> real problem, not a fairy tale told by myself :-).

The authors and the Unicode Technical Committee are well aware that the
Default Unicode Collation Element Table (DUCET) will require tailoring to
meet the ordering requirements of a great many languages.

> Step 2. They state somewhere in Unicode documentation that some categories
> of users can get CUR (Culturally Unexpected Results).

That is already done.See
http://www.unicode.org/unicode/reports/tr10/#Non-Goals, item #6 (which is
repeated in various forms in other places):

"linguistic applicability: to meet most user expectations, a linguistic
tailoring is needed. For more information, see §5 Tailoring."

For example, in ICU (a particular implementation of the UCA), you can tailor
the base table as you wish, for a Persian collation. (ICU is not the only
implementation, but it is the one that I am most familiar with.) E.g. for
German Phonebook ordering, the tailoring is given by:
&ae <<< ä
&AE <<< Ä
&oe <<< ö
&OE <<< Ö
&ue <<< ü
&UE <<< Ü

The syntax for the tailoring is described in the ICU User Guide. It is a
superset of what is in Java, but for your purposes everything you need is
the same.) Now, I just checked on Persian, and it appears that ICU doesn't
tailor the way you want; you could file a bug to change it to the right
ordering (we also periodically forward such changes to the Java people, in
case they wish to incorporate them.)

ICU User Guide:
http://oss.software.ibm.com/icu/userguide/Collate_Customization.html
German Phonebook tailoring:
http://oss.software.ibm.com/cgi-bin/icu/lx/en_US/?_=de__PHONEBOOK
(you can navigate to other locale data from there)
Filing a bug: http://www.jtcsv.com/cgibin/icu-bugs

The Unicode Consortium itself does not have a set of data for tailorings for
different languages / locales. To get any particular vendor to fix its
orderings, you would probably have to contact that vendor. There is an ISO
registry, but it is not very good and mostly ignored.

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Vladimir Ivanov" <iranorus@online.ru>
To: "Mark Davis" <mark.davis@jtcsv.com>
Cc: <unicode@unicode.org>
Sent: Saturday, July 20, 2002 22:23
Subject: Re: UTS #10: Unicode Collation Algorithm (UCA)

> In our discussion about the new Unicode Collation Algorithm presented on
> www.unicode.org/reports/tr10/ one old question remains.
>
> First of all, it is a great job and we must appreciate it at its true
value.
> Among the most significant changes "Notes on notes on linguistic
> applicability" are mentioned. In paragraph "3.1 Linguistic Features" it is
> written "Linguistic requirements of collation are covered in more detail
in
> The Unicode Standard, Version 3.0".
>
> In Section 5.17 Sorting and Searching of the latter document we can find
> interesting general reasoning about Culturally Expected Sorting. For
> example, East Asian ideographs, phonetic sorting of Han characters, German
> and Swedish types of sorting 'a' with diaeresis etc are mentioned. I think
> it is important to mention here the existence of 2 different types of
Arabic
> sorting as well.
>
> As I have pointed out earlier, in Arabic Alphabet, which is used in Arab
> countries, 3 last letters are He(h), Waw and Ye(h). Whereas in countries
and
> languages listed below the order is Waw, He(h) and Ye(h). The latter
> ordering is applicable to:
>
> 1) Iran, Afghanistan and Pakistan, where the Arabic script is official.
> Overall population ca 230 millions.
>
> 2) Kurdish in Iran, Iraq, Turkey and some regions of former Soviet Union.
>
> 3) Tajikistan, where the Arabic script is semi-official (at least all the
> pupils and students study it).
>
> 4) Dozens of millions people in India (especially in the regions adjacent
to
> Pakistan).
>
> You can add dictionary developers, teachers and linguists all over the
world
> who work for the development of cultural contacts with the regions and
> peoples listed above and need convenient software for right sorting Arabic
> words and names.
>
> The ordering of letters in the new collation is presented in
> www.unicode.org/reports/tr10/allkeys-3.1.1.txt. It has no contradictions
> with the recent Iranian official document Dastur-e xatt-e Farsi mosavvab-e
> Farhangestan-e zaban-o adab-e Farsi (Instructions on Persian Script
adopted
> by Persian Academy of Sciences), Tehran, 2002 at the exception of the
> problem mentioned above (see p.19).
>
> To simplify the question let's pick out 30 positions from the All Keys
Table
> beginning with code 0647 up to FBA4 that represent various kinds of letter
> Heh and call it Group 1. Then pick out next 29 positions beginning with
code
> 0648 up to 06CF that represent various kinds of letter Waw and call it
Group
> 2. IMHO somewhere (may be on a higher level of collation) a branch should
be
> provided where Group 1 follows Group 2.
>
> Now that we are developing a new Persian-Russian Dictionary with 100,000
> entries the existing collation algorithm can be used only as a draft
> procedure with light years of manual corrections :-).
>
> Thus our problem can be solved in a few simple steps:
>
> Step 1. The authors of the New Collation Algorithm start thinking of it as
a
> real problem, not a fairy tale told by myself :-).
>
> Step 2. They state somewhere in Unicode documentation that some categories
> of users can get CUR (Culturally Unexpected Results).
>
> Step 3. Official ISO representatives provide necessary proposals.
>
> Step 4. Big software manufacturers like Microsoft make some corrections
for
> dictionary developers.
>
>
>
> Thank you,
>
> Vladimir Ivanov
>
>
>



This archive was generated by hypermail 2.1.2 : Sun Jul 21 2002 - 01:03:40 EDT