Re: Size of Weights in Unicode Collation Algorithm

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Fri, 15 Mar 2013 21:12:48 -0700

On Fri, Mar 15, 2013 at 6:52 PM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> > The "fractional" refers to the same kind of mechanism as the "large
> > weight values" in the UCA spec.
>
> Yes. The problem is that formally the UCA clearly treats 'large
> weights' as being in multiple collation elements, whereas, in various
> places, for transforming collation element tables properly, one needs
> them to be treated as being in a single collation element.
>

Correct, that's where the complexities are that I mentioned. ICU's code has
to look at whether a CE is a "continuation CE" for whether to apply the
script-reordering permutation or the uppercase-first permutation, etc.

> The point is that no sequence of
> > units (8-bit, 16-bit or whatever the implementation uses) can be an
> > exact prefix of another sequence.
>
> That's only for efficiency.

No, it's critical for correctness.

 One could allocate low unit values to the
> start units and high unit values to continuation units. By using high
> values for continuation units, DUCET simplifies the identification'
>

One could pick nearly any range for the trailing units. With the UCA spec
using 16-bit units and only 21 bits to encode in a pair, there is nearly
free choice for the range of trail units.

markus
Received on Fri Mar 15 2013 - 23:18:02 CDT

This archive was generated by hypermail 2.2.0 : Fri Mar 15 2013 - 23:18:03 CDT