Re: Size of Weights in Unicode Collation Algorithm

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Thu, 14 Mar 2013 20:09:33 +0000

On Thu, 14 Mar 2013 00:19:15 +0000
"Whistler, Ken" <ken.whistler_at_sap.com> wrote:

> What is being corrected in the current text of the standard is
> separating the description of the format of DUCET, which *does* use 3
> 16-bit fields to record the 3 weights for each entry, from the
> logical description of tables and the algorithm, which does not
> depend on any particular bit size for the weight values.

Actually, there is a subtle and nasty difference, but probably one that
will very rarely strike practical use. It's most obvious manifestation
is in the application of the UCA parametric tailoring
topVariable="u2FD5". U+2FD5 KANGXI RADICAL FLUTE is the last symbol in
UnicodeData.txt by collating order and has a compatibility
decomposition to U+9FA0 and therefore the same primary weights.
Although I can't find a clear official definition of the semantics of
'topVariable', I do remember being told that it simply uses the first
positive primary in the collation key as the maximum variable weight.
Now in allkeys.txt, U+2FD5 expands to two collation elements. However,
in FractionalUCA.txt, which specifies 32-bit (fractional) weights, it
has a single collation element. Consequently, the effect of this
tailoring will be different depending on how the collation elements are
expressed!

For what it is worth, I think the interpretation based on 32-bit
weights is more natural. The natural solution is to treat 'large
weights' as being composed of an integer part and a fractional part for
the purposes of variable weighting.

Richard.
Received on Thu Mar 14 2013 - 15:12:28 CDT

This archive was generated by hypermail 2.2.0 : Thu Mar 14 2013 - 15:12:29 CDT