Re: Size of Weights in Unicode Collation Algorithm

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Fri, 15 Mar 2013 16:03:57 -0700

On Fri, Mar 15, 2013 at 3:05 PM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> > In CLDR/ICU's FractionalUCA.txt, all but 40 or so of the primary
> > weights (and many of the secondary weights) use the "large weights"
> > mechanism.
>
> No, they're 32-bit weights expressed by omitting trailing zero bytes.
> More precisely, are they not defined to be fractional weights?
>

You can look at it either way. In string comparison, it's easier to deal
with 32-bit weights, but the current ICU code works with one-or-two 16-bit
primaries and one-to-four secondary/tertiary bytes. In sort keys, the
trailing zeros are omitted, and it becomes clearly "fractional".

The "fractional" refers to the same kind of mechanism as the "large weight
values" in the UCA spec. The point is that no sequence of units (8-bit,
16-bit or whatever the implementation uses) can be an exact prefix of
another sequence.

markus
Received on Fri Mar 15 2013 - 18:06:14 CDT

This archive was generated by hypermail 2.2.0 : Fri Mar 15 2013 - 18:06:17 CDT