Re: Size of Weights in Unicode Collation Algorithm

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 16 Mar 2013 21:58:02 +0100

2013/3/16 Richard Wordingham <richard.wordingham_at_ntlworld.com>:
> On Sat, 16 Mar 2013 09:29:07 -0700
> Markus Scherer <markus.icu_at_gmail.com> wrote:
>
>> On Sat, Mar 16, 2013 at 4:09 AM, Richard Wordingham <
>> richard.wordingham_at_ntlworld.com> wrote:
>>
>> > Please give an example of how the low/high split would fail. With
>> > the primary collation weights 20, 21, 21 80 and 22 I get the
>> > following primary collation weight sequences for one and two
>> > collating elements, marking boundaries of collating elements with
>> > commas:
>> >
>>
>> The problem is that if you have 21 and 21 80, and another primary
>> starts with 80, you can't distinguish the sequence 21 | 80 from the
>> one weight 21 80.
>
> But with the low/high split scheme, start units have to have low values
> (e.g. 20, 21 & 22) and continuation units have high values (e.g. 80)
> just to stop this very problem.

Actually no, this is not enough. The scheme cannot just be start vs.
continuation, but non-final vs final. The encoding of weights must be
done so that any encoded weight MUST NOT be a prefix of another
encoded weight.

So if two distinct weights are encoded as 21 vs. 21 80, this encoding
is broken. But the encoding will be valid if they are encoded as 80
vs. 21 80.

There are an infinite number of ways to binary-encode the weights,
they do not even need to be based on sequences of bytes. In fact you
could as well use variable-length sequences of bits such as Huffman
encoding or even better with arithmetic coding, based on a static
dictionnary.

Huffman encoding of weights is simpler to implement, but arithmetic
encoding is not significantly more complex to implement but will
generate collation keys that have the shortest lengths (in both cases,
this only depends on the static predictive dictionary containing
frequencies of use of distinct weights).
Received on Sat Mar 16 2013 - 16:01:54 CDT

This archive was generated by hypermail 2.2.0 : Sat Mar 16 2013 - 16:01:55 CDT