Re: Unicode collation algorithm - Khmer/Cambodian

From: Mark Davis (markdavis34@home.com)
Date: Sat Feb 10 2001 - 12:47:43 EST


I have not been following this discussion up until now. Typically the issue
with syllables is like that with word-sorting. With word sorting, no matter
what is in the second word, any difference in the first word swamps it.
Example:

ab xyz ghi
abc def ghi

In many cases, UCA does handle syllabic encodings without preprocessing. If
the possible syllables form a reasonably small set, then they can be treated
as contractions. Otherwise, if the characters that can begin a syllable are
distinct from those that cannot, then proper assignment of weights works.
For example, suppose that A..Z are initials, and given weights 10..36, and
a..z are medials or finals, and given weights 40..66. Then we get the proper
order of sequences:

AbXyz
AbcDef

And one can apply a combination of techniques. However, this may not be the
particular problem for Khmer. Could you provide a minimal example that
illustrates the problem? Ideally such an example should:

- consist of a minimal number of strings, where no assignment of weights in
UCA could yield the correct ordering. These should use ASCII letters to
represent the Khmer characters, just for legibility in email.

- be historically valid. Books published in Khmer, say 30 years ago, would
consistently show the order in the examples.

- be algorithmically determinable (as opposed to human determination that
St. James sorts as Saint James, while James St. sorts as James Street). UCA
is clearly not designed for that!

Mark

----- Original Message -----
From: "Maurice Bauhahn" <bauhahnm@clara.net>
To: "Unicode List" <unicode@unicode.org>
Sent: Friday, February 09, 2001 20:17
Subject: Re: Unicode collation algorithm - Khmer/Cambodian

> Hello Ken,
>
> Thank you for clearly explaining the problems involved here. I largely
agree
> with you...especially that UCA and 14651 are not up to the mark!
>
> However I believe that these standards should at least have a mechanism to
plug
> in the requirements of syllabic sorting languages such as Khmer (accept a
> syllable,distribute to columns). Granted it is language specific (but so
are
> other collations). My immediate concern, however, is that even if I gave
to
> UCA/14651 single syllables, wouldn't each subcomponent (base consonant,
first
> subscript, second subscript, vowel [which may be composed with a sign],
sign1,
> sign2) have to be handled as a separate column? Again I do not see how the
huge
> number of combinations could be handled productively by either standard
(which
> appear to me to sort upon precomposed combinations).
>
> As Ken knows we are still working on what would be an acceptable
compromise for
> expressing Khmer collation as an algorithm. There are other languages
which
> require analygous treatment (Myanmar and many other Indic scripts)...so
global
> solutions do require a mechanism to plug in these technological
challenges.
>
> Simply disregarding syllabic sorting languages in UCA/14651 would be
analygous
> (in the encoding arena) to setting up an alternative to Unicode/10646 -
clearly
> unacceptable.
>
> Ever hopeful,
>
> Maurice
>
> Kenneth Whistler wrote:
>
> > In response to Maurice's query, it is my assessment that neither
> > the Unicode Collation Algorithm nor its technical equivalent, 14651,
> > are up to the mark for syllabic-based ordering. Details may differ,
> > from case to case, but effectively, the issue is as follows.
> >
> > In syllabic-based ordering, you need first to be able to
> > identify syllabic boundaries. Then you can weight all the syllables
> > via a mechanism like the UCA, to get the appropriate multi-level
> > weighting for primary letters, secondary accents, and so on. Then,
> > to get the final ordering, you do what is effectively a multi-column
> > sort, first on the first syllables, then on the second syllables, and
> > so on.
> >
> > Conceptually, this is like putting all the strings in a spreadsheet,
> > separating them out so you get one string per row, and one syllable
> > per column, starting from the first column. Then put a formula in
> > each cell that computes a multilevel weight for that syllable using
> > the UCA. Then sort the computed values of all the cells with a
multicolumn
> > sort.
> >
> > So this is really a matter of higher level processing that depends on
> > three things:
> >
> > A. A syllabic parser
> > B. The UCA algorithm for weighting the pieces
> > C. A multicolumn sorting mechanism
> >
> > While the multicolumn sorting is a natural for databases and the
> > SQL standard, and while the UCA algorithm can probably be meaningfully
> > tied to a UNICHAR datatype support in the SQL standard, I think the
> > syllabic parsing aspect is out-of-bounds. That really is a
language-specific
> > issue that needs to be dealt with on a language-by-language and
> > writing system by writing system basis, and is not a problem that
> > ought to be tackled in something like the SQL standard (nor UCA,
> > for that matter).
> >
> > --Ken
> >
> > >
> > > I'm afraid you have the wrong bloke here, Maurice. The technicality of
my
> > > query may have ffoled you into thinking I'm a UTR#10 expert - far from
it!
> > >
> > > All I can do is cc your query to the Unicode list - and wish you luck,
> > > naturally :-)
> > >
> > > Mike.
> > >
> > > ----- Original Message -----
> > > From: "Maurice Bauhahn" <mbauhahn@brio.com>
> > > To: <Mike.Sykes@acm.org>
> > > Sent: Thursday, February 08, 2001 2:27 PM
> > > Subject: Unicode collation algorithm - interpretation
> > >
> > >
> > > > Hello Mike, from the U.K.!
> > > >
> > > > What I have seen of the Unicode collation algorithm makes me wonder
> > > whether
> > > > it will handle syllabic-based ordering! I specialise in
> > > > which has (at least) six levels of priority within each syllable.
> > > Hopefully
> > > > SQL collation will be open to such difficult environments.
> > > >
> > > > http://www.bauhahnm.clara.net/KhmerSortingUnicodebeta.pdf
> > > >
> > > > Cheers,
> > > >
> > > > Maurice Bauhahn
>
> --
> Maurice Bauhahn
> 2 Meadow Way
> Dorney Reach
> MAIDENHEAD
> SL6 0DS
> United Kingdom
> Home Tel: +44(0)1628 626068
> Work Tel: +44(0)1932 878404
> Home Email: bauhahnm@clara.net
> Work Email: mbauhahn@brio.com
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT