RE: Size of Weights in Unicode Collation Algorithm

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Wed, 13 Mar 2013 21:07:06 +0000

Richard Wordingham wrote:

> One of the changes from Version 6.1.0 to 6.2.0 of the the UCA (UTS#10)
> was to changed weights from being 16 bits to just being general
> non-negative integers. Was this just to accommodate the 4th weight in
> DUCET (scheduled for deletion in Version 6.3.0), or is it intended to do
> away with the inconvenient concept of 'large weights'?

Amplifying somewhat on Markus' response to these questions...

In UCA 6.1.0, the wording was:

"...where a collation element is an ordered list of three or more 16-bit weights."

In UCA 6.2.0, the wording is:

"...where a collation element is an ordered list of three or more weights (non-negative integers)."

This change had nothing to do with accommodating the 4th weight in DUCET.

It has nothing to do with any putatively inconvenient concept of large weights.

It loosened up the spec, so that the spec itself didn't seem to be requiring that each of the first 3 levels had to be expressed with a full 16 bits in any collation element table.

>
> Previously, each of the four weights could be accommodated in 16, 16,
> 16 and 24 bits. How many bits may be needed for a DUCET collation
> element now? Are we threatened with having to accommodate 36 bit
> weights?

As a matter of convenience in generation and display, the DUCET has always been generated using a 4 digit hex notation for the first 3 levels. So each could be conceived as a 16-bit number, as the original description of collation elements implied.

But in practice (and by design), the range of secondary and tertiary weights were constrained. You only need 9 bits to express the secondary weights in the table and only 5 bits to express the tertiary weights.

So it is rather straightforward to pack DUCET primary, secondary, and tertiary weights into a single 32-bit collation element, with 2 bits left over for flag bits (or whatever). I've been doing that for years in my implementation. You don't need 48 bits to do that -- 32 works just fine.

And no, nobody is "threatening" you or anybody else with "having to accommodate 36 bit weights".

>
> If it is not intended to do away with the 16-bit limit, then the
> introduction to Section 3.0 should revert to describing the weights as
> 16 bits. Otherwise, there is a good deal of text that is wrong or in
> need of overhaul. For example, a value FFFF will not function as
> intended if the smallest explicit positive primary weight is 100,000.

It might make sense to include a note somewhere to indicate that some aspects of the algorithm do implicitly assume that weights cannot exceed 16-bit values without requiring other adjustments to the algorithm. Section 6.2 Large Weight Values already addresses the approach one would take if one needs to deal with more than 64K primary weight values, in a way which does not break the rest of the algorithm.

--Ken

>
> I've not submitted this through formal feedback yet, as my feedback will
> depend on what is intended.
>
> Richard.
Received on Wed Mar 13 2013 - 16:10:30 CDT

This archive was generated by hypermail 2.2.0 : Wed Mar 13 2013 - 16:10:31 CDT