Re: Size of Weights in Unicode Collation Algorithm

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 16 Mar 2013 22:29:34 +0100

> 2013/3/16 Richard Wordingham <richard.wordingham_at_ntlworld.com>:
>> But with the low/high split scheme, start units have to have low values
>> (e.g. 20, 21 & 22) and continuation units have high values (e.g. 80)
>> just to stop this very problem.

Note also that all technics used for data compression can be used there as well.

For example frequent sequences of weights could also have a predictive
encoding model, notably when creating collation keys for strenghts 2
or higher, because there will be very frequent sequences of identical
secondary weights.

So instead of encoding the same secondary weights independantly, you
could also represent the sequnce of two identical secondaries by a
single code. Predictive models include distinary-based compression
schemes such as LZW, and it is also combinable woth the Huffmann or
aithmetic coding to compress these seuqnces with the smallest number
of bits. in the final collation key.

The result of these optimization steps is such that collation keys
could even be the same size as the oriignal plain-text, but they could
even be shorter and have the same average size as the result of a
data-compression of plain-text (if you have first trained your
encoding using a significant corpus of text from which you have
precomputed the predictive model which will finally be used to encode
all collation keys.

In such a scheme you don't even need to make distinction between
weights for different collation levels. For example if two distinct
collation elements only have primary differences, and there cannot
exist any other collation element thatwill sort between them, then
their secondary differences does not matter, and zeroes no longer need
to be encoded (the compression scheme will then generate zero bits for
these secondaries) : the LZW algorithm would detect this implicitly.

The only complexity is about how you will precompute the dictionary
needed for the predictive model, and the number of entries it will
contain (because this dictionnary will need to be static, and not
dynamic as collation keys must still be computable separately from
separate occurences of plain text).
Received on Sat Mar 16 2013 - 16:32:19 CDT

This archive was generated by hypermail 2.2.0 : Sat Mar 16 2013 - 16:32:20 CDT