From: Philippe Verdy <verdy_p_at_wanadoo.fr>

Date: Sat, 16 Mar 2013 21:58:02 +0100

Date: Sat, 16 Mar 2013 21:58:02 +0100

2013/3/16 Richard Wordingham <richard.wordingham_at_ntlworld.com>:

*> On Sat, 16 Mar 2013 09:29:07 -0700
*

*> Markus Scherer <markus.icu_at_gmail.com> wrote:
*

*>
*

*>> On Sat, Mar 16, 2013 at 4:09 AM, Richard Wordingham <
*

*>> richard.wordingham_at_ntlworld.com> wrote:
*

*>>
*

*>> > Please give an example of how the low/high split would fail. With
*

*>> > the primary collation weights 20, 21, 21 80 and 22 I get the
*

*>> > following primary collation weight sequences for one and two
*

*>> > collating elements, marking boundaries of collating elements with
*

*>> > commas:
*

*>> >
*

*>>
*

*>> The problem is that if you have 21 and 21 80, and another primary
*

*>> starts with 80, you can't distinguish the sequence 21 | 80 from the
*

*>> one weight 21 80.
*

*>
*

*> But with the low/high split scheme, start units have to have low values
*

*> (e.g. 20, 21 & 22) and continuation units have high values (e.g. 80)
*

*> just to stop this very problem.
*

Actually no, this is not enough. The scheme cannot just be start vs.

continuation, but non-final vs final. The encoding of weights must be

done so that any encoded weight MUST NOT be a prefix of another

encoded weight.

So if two distinct weights are encoded as 21 vs. 21 80, this encoding

is broken. But the encoding will be valid if they are encoded as 80

vs. 21 80.

There are an infinite number of ways to binary-encode the weights,

they do not even need to be based on sequences of bytes. In fact you

could as well use variable-length sequences of bits such as Huffman

encoding or even better with arithmetic coding, based on a static

dictionnary.

Huffman encoding of weights is simpler to implement, but arithmetic

encoding is not significantly more complex to implement but will

generate collation keys that have the shortest lengths (in both cases,

this only depends on the static predictive dictionary containing

frequencies of use of distinct weights).

Received on Sat Mar 16 2013 - 16:01:54 CDT

*
This archive was generated by hypermail 2.2.0
: Sat Mar 16 2013 - 16:01:55 CDT
*