Comparing Collations

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Wed, 22 Aug 2012 23:02:42 +0100

Is there any simple way of comparing UCA-conforming collations to
determine if they are equal? I'm assuming that the table is visible
and that the collating elements (ISO/IEC 14651 terminology for the
items looked up in the collation element table) are the same. (Black
box identification of the collation elements in finitely many steps is
impossible.)

The first step is to check that collating elements sort in the same
order, and with the same category of differences (I don't think there's
a real theoretical difficulty there - I think spurious levels can be
eliminated), but after that I'm floundering. All the nice algebraic
properties are associated with the collation elements (the n-tuples of
weights), not with the collating elements.

For an example of algebraic nastiness, "a" < "ab", "bd" < "c", but
"abd" > "abc".

As an example with just two Thai characters, in DUCET 6.1.0, we have,
amongst the collating elements:

ก < เก < เ (<U+0E01> < <U+0E40, U+0E01> < <U+0E40>)

The same applies in DUCET 6.2.0 (or at least, the current daft, Draft
9).

However, when we add other strings, we get, for example:

DUCET 6.1.0: ก < กเ <<< เก < เ
(<U+0E01> < <U+0E01, U+0E40> <<< <U+0E40, U+0E01> < <U+0E40>)

DUCET 6.2.0: ก < กเ = เก < เ
(<U+0E01> < <U+0E01, U+0E40> = <U+0E40, U+0E01> < <U+0E40>)
(The 4th level weights make a difference for the CLDR root collation to
be called up by UCA 6.2.0!)

I can produce a similar example with the Roman numerals U+2170 ROMAN
NUMERAL ONE to U+2172 ROMAN NUMERAL THREE.

In the comparison I'm planning to make, there will not be a 1-1
correspondence of weights.

An algorithm (or approach) will be useful to me even if it requires that
in the two collations the same collating element expands to the same
number of non-null collation elements. (By null collation element, I
mean elements such as [0000.0000.0000].)

Richard.
Received on Wed Aug 22 2012 - 17:05:58 CDT

This archive was generated by hypermail 2.2.0 : Wed Aug 22 2012 - 17:06:00 CDT