Re: Unicode collation algorithm - Khmer/Cambodian

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Feb 09 2001 - 21:39:45 EST


In response to Maurice's query, it is my assessment that neither
the Unicode Collation Algorithm nor its technical equivalent, 14651,
are up to the mark for syllabic-based ordering. Details may differ,
from case to case, but effectively, the issue is as follows.

In syllabic-based ordering, you need first to be able to
identify syllabic boundaries. Then you can weight all the syllables
via a mechanism like the UCA, to get the appropriate multi-level
weighting for primary letters, secondary accents, and so on. Then,
to get the final ordering, you do what is effectively a multi-column
sort, first on the first syllables, then on the second syllables, and
so on.

Conceptually, this is like putting all the strings in a spreadsheet,
separating them out so you get one string per row, and one syllable
per column, starting from the first column. Then put a formula in
each cell that computes a multilevel weight for that syllable using
the UCA. Then sort the computed values of all the cells with a multicolumn
sort.

So this is really a matter of higher level processing that depends on
three things:

A. A syllabic parser
B. The UCA algorithm for weighting the pieces
C. A multicolumn sorting mechanism

While the multicolumn sorting is a natural for databases and the
SQL standard, and while the UCA algorithm can probably be meaningfully
tied to a UNICHAR datatype support in the SQL standard, I think the
syllabic parsing aspect is out-of-bounds. That really is a language-specific
issue that needs to be dealt with on a language-by-language and
writing system by writing system basis, and is not a problem that
ought to be tackled in something like the SQL standard (nor UCA,
for that matter).

--Ken

>
> I'm afraid you have the wrong bloke here, Maurice. The technicality of my
> query may have ffoled you into thinking I'm a UTR#10 expert - far from it!
>
> All I can do is cc your query to the Unicode list - and wish you luck,
> naturally :-)
>
> Mike.
>
> ----- Original Message -----
> From: "Maurice Bauhahn" <mbauhahn@brio.com>
> To: <Mike.Sykes@acm.org>
> Sent: Thursday, February 08, 2001 2:27 PM
> Subject: Unicode collation algorithm - interpretation
>
>
> > Hello Mike, from the U.K.!
> >
> > What I have seen of the Unicode collation algorithm makes me wonder
> whether
> > it will handle syllabic-based ordering! I specialise in
> > which has (at least) six levels of priority within each syllable.
> Hopefully
> > SQL collation will be open to such difficult environments.
> >
> > http://www.bauhahnm.clara.net/KhmerSortingUnicodebeta.pdf
> >
> > Cheers,
> >
> > Maurice Bauhahn



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT