Re: Size of Weights in Unicode Collation Algorithm

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Wed, 13 Mar 2013 13:22:01 -0700

On Wed, Mar 13, 2013 at 11:38 AM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> One of the changes from Version 6.1.0 to 6.2.0 of the the UCA (UTS#10)
> was to changed weights from being 16 bits to just being general
> non-negative integers. Was this just to accommodate the 4th weight in
> DUCET (scheduled for deletion in Version 6.3.0), or is it intended to do
> away with the inconvenient concept of 'large weights'?
>

Neither. It's because the algorithm has very little to do with how exactly
the weights are stored. For example, ICU logically stores weights as
sequences of 1, 2, 3 or 4 bytes, with collation elements encoded in
interesting ways so that most CEs fit into 32-bit integers.

Previously, each of the four weights could be accommodated in 16, 16,
> 16 and 24 bits. How many bits may be needed for a DUCET collation
> element now?

There is no plan to change how the DUCET is expressed, nor how the weight
examples are written in the UCA spec.

While the algorithm does not depend on the particular weight size, nor on
the particular weight values, it would be hard and confusing to fully write
the spec without ever using concrete numeric examples.

Are we threatened with having to accommodate 36 bit
> weights?
>

Data structure design is up to each implementation.

markus
Received on Wed Mar 13 2013 - 15:27:05 CDT

This archive was generated by hypermail 2.2.0 : Wed Mar 13 2013 - 15:27:07 CDT