Re: Sort in DBCS

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Apr 18 1996 - 15:35:24 EDT


>
> I am no authority, but I have heard several times that using
> Unicode order for collation is a bad idea (even Unicode says
> this).

The Unicode Consortium does not say this. The guidelines on
sorting to be published in the Unicode Standard, Version 2.0,
provide some principles and examples for language-specific
and culturally-relevant sorting. Then there are default fallback
principles:

   Default to a common or culturally neutral ordering for
   out-of-scope characters. [e.g. if you are collating
   Swedish, it would be fine to default to a culturally
   neutral ordering for Han characters -- but if you are
   collating Japanese phonetically, you would obviously
   have to do a complex collation involving dictionary
   lookup, and a culturally netural ordering for the Han
   characters would not be appropriate.]

   Collate irrelevant characters in Unicode bit-order, in a
   specified position. [e.g. if you are simply sorting
   hex formatted numbers, it doesn't matter what you do with the
   rest -- just use the Unicode bit-order.]

Toyoshima-san was correct in stating the it is a good idea to default
the sorting of Han characters in Unicode to their binary order, because
the encoding of the Han characters was carefully devised to give them
a meaningful, but culturally neutral order.

>
> Apparently the order may be close for one of the Chinese's (traditional
> or simplified, I forget which), but even this should not be counted
> on.

As Toyoshima-san pointed out, it is traditional radical-stroke order.
The exact placement followed a series of rules depending primarily on
the order in the Kangxi dictionary, with subsidiary rules for characters
not in the Kangxi dictionary. [No, that is not a typo for Kanji: "Kangxi" is a
Qing dynasty reign name, during which an official, large Chinese dictionary
compendium was published.]

>
> HOWEVER, I believe that major database vendors like Oracle and Sybase
> will "sort" Unicode (actually UTF-8) using Unicode order (ie. they
> don't sort!).

The default sort order for Unicode data will certainly be in Unicode
binary order. However, the database vendors, including Sybase, provide
mechanisms for defining collation orders for databases. There is no
reason to suppose these mechanisms will not apply to Unicode, as well
as to other character sets supported by the databases. However, given
the complexities of language-specific and culturally-dependent sorting
rules, it is unlikely that particular collation orders you have in
mind will be delivered "in-the-box" with database software.

>
> My question is: what do do about this?

Press the database vendors to provide default collations for common
languages which work on tables with text stored in Unicode. And
also press them to provide simpler mechanisms for defining and
using custom collations.

--Ken Whistler
Technical Director, Unicode, Inc.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:30 EDT