RE: Size of Weights in Unicode Collation Algorithm

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Thu, 14 Mar 2013 00:19:15 +0000

Richard Wordingham wrote:

> > It loosened up the spec, so that the spec itself didn't seem to be
> > requiring that each of the first 3 levels had to be expressed with a
> > full 16 bits in any collation element table.
>
> I don't read it that way. But it did allow the 4th weight to go up to
> 10FFFF! (Last explicit weight in DUCET 6.2.0 is 2A600.)

Actually it didn't "allow" the 4th weight to do anything. The last explicit weight
in DUCET 6.1.0 was already 2A600 for the 4th level. The table was basically just tacking
on the code point, and when the code point was > 0xFFFF, the value for
the fourth weight was also > 0xFFFF. There was never any intention to
constrain it strictly thus to 16 bits.

In fact if you go all the way back in the document history to the version
of UCA that was more or less correlated with Unicode 3.0 (which didn't
have *any* supplementary characters defined), the relevant wording in
UTS #10 was just:

"A collation element is an ordered list of three 16-bit weights. (Implementations
can produce the same result without using 16-bit weights...)"

>
> > As a matter of convenience in generation and display, the DUCET has
> > always been generated using a 4 digit hex notation for the first 3
> > levels. So each could be conceived as a 16-bit number, as the
> > original description of collation elements implied.
> >
> > But in practice (and by design), the range of secondary and tertiary
> > weights were constrained. You only need 9 bits to express the
> > secondary weights in the table and only 5 bits to express the
> > tertiary weights.
>
> DUCET and the CLDR root are not the only UCETs. I recall nothing that
> stops a tailoring needing more bits for the secondary and tertiary
> weights.

Of course not. And nothing in the wording of UCA prevents that.

>
> > And no, nobody is "threatening" you or anybody else with "having to
> > accommodate 36 bit weights".
>
> But I can no longer turn round and say that a 36 bit weight is illegal.

That standard never said that anyway.

What is being corrected in the current text of the standard is separating the
description of the format of DUCET, which *does* use 3 16-bit fields to
record the 3 weights for each entry, from the logical description of tables
and the algorithm, which does not depend on any particular bit size for
the weight values.

>
> > It might make sense to include a note somewhere to indicate that some
> > aspects of the algorithm do implicitly assume that weights cannot
> > exceed 16-bit values without requiring other adjustments to the
> > algorithm.
>
> I'm listing them at the moment.

O.k.

--Ken
Received on Wed Mar 13 2013 - 19:23:35 CDT

This archive was generated by hypermail 2.2.0 : Wed Mar 13 2013 - 19:23:36 CDT