Re: Unicode collation algorithm - Khmer/Cambodian

From: Maurice Bauhahn (bauhahnm@clara.net)
Date: Fri Feb 09 2001 - 23:40:48 EST


Hello Ken,

Thank you for clearly explaining the problems involved here. I largely agree
with you...especially that UCA and 14651 are not up to the mark!

However I believe that these standards should at least have a mechanism to plug
in the requirements of syllabic sorting languages such as Khmer (accept a
syllable,distribute to columns). Granted it is language specific (but so are
other collations). My immediate concern, however, is that even if I gave to
UCA/14651 single syllables, wouldn't each subcomponent (base consonant, first
subscript, second subscript, vowel [which may be composed with a sign], sign1,
sign2) have to be handled as a separate column? Again I do not see how the huge
number of combinations could be handled productively by either standard (which
appear to me to sort upon precomposed combinations).

As Ken knows we are still working on what would be an acceptable compromise for
expressing Khmer collation as an algorithm. There are other languages which
require analygous treatment (Myanmar and many other Indic scripts)...so global
solutions do require a mechanism to plug in these technological challenges.

Simply disregarding syllabic sorting languages in UCA/14651 would be analygous
(in the encoding arena) to setting up an alternative to Unicode/10646 - clearly
unacceptable.

Ever hopeful,

Maurice

Kenneth Whistler wrote:

> In response to Maurice's query, it is my assessment that neither
> the Unicode Collation Algorithm nor its technical equivalent, 14651,
> are up to the mark for syllabic-based ordering. Details may differ,
> from case to case, but effectively, the issue is as follows.
>
> In syllabic-based ordering, you need first to be able to
> identify syllabic boundaries. Then you can weight all the syllables
> via a mechanism like the UCA, to get the appropriate multi-level
> weighting for primary letters, secondary accents, and so on. Then,
> to get the final ordering, you do what is effectively a multi-column
> sort, first on the first syllables, then on the second syllables, and
> so on.
>
> Conceptually, this is like putting all the strings in a spreadsheet,
> separating them out so you get one string per row, and one syllable
> per column, starting from the first column. Then put a formula in
> each cell that computes a multilevel weight for that syllable using
> the UCA. Then sort the computed values of all the cells with a multicolumn
> sort.
>
> So this is really a matter of higher level processing that depends on
> three things:
>
> A. A syllabic parser
> B. The UCA algorithm for weighting the pieces
> C. A multicolumn sorting mechanism
>
> While the multicolumn sorting is a natural for databases and the
> SQL standard, and while the UCA algorithm can probably be meaningfully
> tied to a UNICHAR datatype support in the SQL standard, I think the
> syllabic parsing aspect is out-of-bounds. That really is a language-specific
> issue that needs to be dealt with on a language-by-language and
> writing system by writing system basis, and is not a problem that
> ought to be tackled in something like the SQL standard (nor UCA,
> for that matter).
>
> --Ken
>
> >
> > I'm afraid you have the wrong bloke here, Maurice. The technicality of my
> > query may have ffoled you into thinking I'm a UTR#10 expert - far from it!
> >
> > All I can do is cc your query to the Unicode list - and wish you luck,
> > naturally :-)
> >
> > Mike.
> >
> > ----- Original Message -----
> > From: "Maurice Bauhahn" <mbauhahn@brio.com>
> > To: <Mike.Sykes@acm.org>
> > Sent: Thursday, February 08, 2001 2:27 PM
> > Subject: Unicode collation algorithm - interpretation
> >
> >
> > > Hello Mike, from the U.K.!
> > >
> > > What I have seen of the Unicode collation algorithm makes me wonder
> > whether
> > > it will handle syllabic-based ordering! I specialise in
> > > which has (at least) six levels of priority within each syllable.
> > Hopefully
> > > SQL collation will be open to such difficult environments.
> > >
> > > http://www.bauhahnm.clara.net/KhmerSortingUnicodebeta.pdf
> > >
> > > Cheers,
> > >
> > > Maurice Bauhahn

--
Maurice Bauhahn
2 Meadow Way
Dorney Reach
MAIDENHEAD
SL6 0DS
United Kingdom
Home Tel: +44(0)1628 626068
Work Tel: +44(0)1932 878404
Home Email: bauhahnm@clara.net
Work Email: mbauhahn@brio.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT