Re: Size of Weights in Unicode Collation Algorithm

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 16 Mar 2013 01:52:35 +0000

On Fri, 15 Mar 2013 16:03:57 -0700
Markus Scherer <markus.icu_at_gmail.com> wrote:

> On Fri, Mar 15, 2013 at 3:05 PM, Richard Wordingham <
> richard.wordingham_at_ntlworld.com> wrote:
>
> > > In CLDR/ICU's FractionalUCA.txt, all but 40 or so of the primary
> > > weights (and many of the secondary weights) use the "large
> > > weights" mechanism.
> >
> > No, they're 32-bit weights expressed by omitting trailing zero
> > bytes. More precisely, are they not defined to be fractional
> > weights?
> >
>
> You can look at it either way. In string comparison, it's easier to
> deal with 32-bit weights, but the current ICU code works with
> one-or-two 16-bit primaries and one-to-four secondary/tertiary bytes.
> In sort keys, the trailing zeros are omitted, and it becomes clearly
> "fractional".
>
> The "fractional" refers to the same kind of mechanism as the "large
> weight values" in the UCA spec.

Yes. The problem is that formally the UCA clearly treats 'large
weights' as being in multiple collation elements, whereas, in various
places, for transforming collation element tables properly, one needs
them to be treated as being in a single collation element.

> The point is that no sequence of
> units (8-bit, 16-bit or whatever the implementation uses) can be an
> exact prefix of another sequence.

That's only for efficiency. One could allocate low unit values to the
start units and high unit values to continuation units. By using high
values for continuation units, DUCET simplifies the identification'

Richard.
Received on Fri Mar 15 2013 - 20:57:06 CDT

This archive was generated by hypermail 2.2.0 : Fri Mar 15 2013 - 20:57:07 CDT