Re: Size of Weights in Unicode Collation Algorithm

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Wed, 13 Mar 2013 22:30:20 +0000

On Wed, 13 Mar 2013 21:07:06 +0000
"Whistler, Ken" <ken.whistler_at_sap.com> wrote:

> Richard Wordingham wrote:
>
> > One of the changes from Version 6.1.0 to 6.2.0 of the the UCA
> > (UTS#10) was to changed weights from being 16 bits to just being
> > general non-negative integers. Was this just to accommodate the
> > 4th weight in DUCET (scheduled for deletion in Version 6.3.0), or
> > is it intended to do away with the inconvenient concept of 'large
> > weights'?

> It has nothing to do with any putatively inconvenient concept of
> large weights.

'Large weights' make it difficult (I don't say impossible) to check
UCETs for well-formedness.

> It loosened up the spec, so that the spec itself didn't seem to be
> requiring that each of the first 3 levels had to be expressed with a
> full 16 bits in any collation element table.

I don't read it that way. But it did allow the 4th weight to go up to
10FFFF! (Last explicit weight in DUCET 6.2.0 is 2A600.)

> As a matter of convenience in generation and display, the DUCET has
> always been generated using a 4 digit hex notation for the first 3
> levels. So each could be conceived as a 16-bit number, as the
> original description of collation elements implied.
>
> But in practice (and by design), the range of secondary and tertiary
> weights were constrained. You only need 9 bits to express the
> secondary weights in the table and only 5 bits to express the
> tertiary weights.

DUCET and the CLDR root are not the only UCETs. I recall nothing that
stops a tailoring needing more bits for the secondary and tertiary
weights.

> And no, nobody is "threatening" you or anybody else with "having to
> accommodate 36 bit weights".

But I can no longer turn round and say that a 36 bit weight is illegal.

> It might make sense to include a note somewhere to indicate that some
> aspects of the algorithm do implicitly assume that weights cannot
> exceed 16-bit values without requiring other adjustments to the
> algorithm.

I'm listing them at the moment.

> Section 6.2 Large Weight Values already addresses the
> approach one would take if one needs to deal with more than 64K
> primary weight values, in a way which does not break the rest of the
> algorithm.

You've just reminded me that 'escape hatch' is broken for secondary
weights. It seems a shame to me that one can't parametrically tailor
DUCET to give a rhyming dictionary sort.

Richard.
Received on Wed Mar 13 2013 - 17:37:37 CDT

This archive was generated by hypermail 2.2.0 : Wed Mar 13 2013 - 17:37:42 CDT