Re: Size of Weights in Unicode Collation Algorithm

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Fri, 15 Mar 2013 19:50:13 +0000

On Thu, 14 Mar 2013 19:13:43 -0700
Markus Scherer <markus.icu_at_gmail.com> wrote:

> On Thu, Mar 14, 2013 at 4:09 PM, Richard Wordingham <
> richard.wordingham_at_ntlworld.com> wrote:
>
> > On Thu, 14 Mar 2013 14:49:18 -0700
> > Markus Scherer <markus.icu_at_gmail.com> wrote:

> While variableTop="u2FD5" ...

> ... but a) this is a rarely used option and b) depends on the
> implementation, and c) it makes no practical sense to make letters
> ignorable.

That doesn't stop the LDML specification having the example locales
en-u-vt-0061 and en-u-vt-0061-0065. (I don't see what collating
elements have variable weight in one but not the other.)

> "Fractional weights" is nothing other than the "large weights"
> mechanism applied to byte-based weights of all levels. The UCA is
> already "fractional" for implicit primaries.

Not quite. The characterisation of variable weights knows nothing of
the concept, and that is the problem. One can envisage a *remapping*
of DUCET such that all non-Han characters get 'large weights',
reserving the one number primary weights for Han characters. Changing
the range of variable weights parametrically (e.g. from up to symbols to
up to punctuation) in that could be a nightmare.

There appears to be an algebraic characterisation of what sequences of
primaries could be treated as variable. Strings of primary
weights of collating elements can be decomposed into substrings of
primary weights of other collating elements and an order preserving
change of the irreducible substrings will preserve the order of the
collating elements. This is a consequence of how humans (or
just Unicode man?) generate primary weights, and does not apply to
collation elements in general. This decomposition is just a
reflection of the sometimes non-standard compatibility decomposition
used when devising weights. The irreducible substrings are the elements
that can be treated as variable. However, the ability to decompose is
fragile - it would not work if the Tamil script had been encoded as a
syllabary but collated as consonants and vowels.

> And pinning the variable-top value to the next following end of a
> reordering group, and no higher than the end of the primary-weight
> range for currency symbols.

Is there a CLDR ticket for this change to the meaning of variableTop?
I couldn't find one.

Richard.
Received on Fri Mar 15 2013 - 14:57:39 CDT

This archive was generated by hypermail 2.2.0 : Fri Mar 15 2013 - 14:57:41 CDT