Re: Size of Weights in Unicode Collation Algorithm

From: Richard Wordingham <>
Date: Fri, 15 Mar 2013 22:05:06 +0000

On Fri, 15 Mar 2013 13:52:39 -0700
Markus Scherer <> wrote:

> On Fri, Mar 15, 2013 at 12:50 PM, Richard Wordingham <
>> wrote:

> > Not quite. The characterisation of variable weights knows nothing
> > of the concept, and that is the problem.
> That's a problem in some implementations, but not a problem with the
> concept. Nothing prevents you from defining a variableTop that
> contains a string of "large weights", and comparing that lexically
> with the string of "large weights" that you look up for a character
> or substring. In fact, that's really what ICU does, except the
> current code is limited to one-or-two units (bytes).

I would say that the UCA Section 6.2 stops me. It clearly says that
the generic example '[(X+1).zzzz.wwww], [yyyy.0000.0000]' is two
collation elements, not one.

Now, if I used allkeys_CLDR.txt as a convenient expression of
FractionalUCA.txt rather than in its own right, I might now be able to
argue that the large weights were just a convenient internal
representation of a 32-bit weight.

A possible argument is that although a tailoring has to be defined
by a 'well-defined syntax' (What syntax defines FractionalUCA.txt?
Is it 'Use this instead'?), there is no requirement that this syntax
has to have well-defined semantics. So, if the string specified by
variableTop has a primary starting with a 'large weight', I could
interpret that to mean that the 2-element large weights are to be
converted to 32-bit weights. Does anyone accept this argument?

> In CLDR/ICU's FractionalUCA.txt, all but 40 or so of the primary
> weights (and many of the secondary weights) use the "large weights"
> mechanism.

No, they're 32-bit weights expressed by omitting trailing zero bytes.
More precisely, are they not defined to be fractional weights?

Received on Fri Mar 15 2013 - 17:09:37 CDT

This archive was generated by hypermail 2.2.0 : Fri Mar 15 2013 - 17:09:38 CDT