From: Philippe Verdy <verdy_p_at_wanadoo.fr>

Date: Wed, 7 Sep 2011 05:14:51 +0200

Date: Wed, 7 Sep 2011 05:14:51 +0200

I've just tried this idea, and this is working magically as I wanted.

It generates extremely compact collation keys, notably if I no longer

restrict collation levels so that they use a limited range: instead I

explicitly use a level separator, to which I assign the final range of

rationals [7/16 .. 16/16], that I can represent as 4 bits (1111).

Then each level can use the collation weights in the half-open range

[0/16 .. 7/16[, which can be splitted in as many elements as needed. I

have also used the [0/16 .. 1/16[ half-open range for the weight

representing ignorable elements: this will use 4 bits(0000) in the

collation keys. This leaves the half-open range [1/16 .. 7/16[ for

representing all non-ignorable weights for this level.

For the primary level, I can predefined fixed ranges of rationals for

representing variable elements, whitespaces, symbols, punctuations,

numbers, and then letters for each script (also with fixed ranges for

each script).

Loading tailorings becomes extremely easy when using rationals, the

algorithms become MUCH simpler.

And finally, once you have assigned rationals to all collation

elements for a specific tailoring (including for the DUCET itself), it

is an extremely easy transform to infer integer weights. This means

that you no longer need to use gaps in the sequences of default

weights, or to given them special meanings. But my opinion is that

this representation using integers is much more difficult to compress

in collation keys.

In fact, in a specific locale, where collation elements have some good

estimation of the frequency of use of collation elements, it is easy

to assign them an associated fixed rational, and to place all other

collation elements by spliting the intervals as needed to fit. And you

get extremely compact collation keys represented eigher by Huffman

encoding (or by arithmetic codding if you have a licence for it from

IBM, but this does not reduces much the length of collation keys).

Philippe.

2011/9/7 Philippe Verdy <verdy_p_at_wanadoo.fr>:

*> 2011/8/31 <announcements_at_unicode.org>:
*

*>> PRI #203 Proposed Update for UTS #10: Unicode Collation Algorithm
*

*>> http://www.unicode.org/review/pri203/
*

*>
*

*> I just have had an idea, which may be interesting. Currently, the
*

*> collation weights defined in the DUCET for each level are maintained
*

*> as a list of integers only.
*

*>
*

*> The main problem is that at each revision of the UCS (when characters
*

*> get added), or for the need to correct the relative weights of
*

*> characters, the collation weights need to be renumbered (notably for
*

*> the primary level, but in fact as well for the subsequent levels, if
*

*> there's a need to make more numeric space available for higher
*

*> collation levels that normally have lower collation weight values).
*

*>
*

*> Why not using a representation not based on integers, but instead on
*

*> *rationals* ? We can indefinitely split a rational interval without
*

*> having to renumber all other nearby collation elements at the same or
*

*> lower collation levels.
*

*>
*

*> And when designing a tailoring, instead of complex renumbering
*

*> strategies, we can as well subdivide intervals into as many rational
*

*> subintervals as we want (just choose the denominator of the inserted
*

*> rationals as the product of the denominator D for the existing level
*

*> you want to split, by the number N of (element to insert, plus one),
*

*> and then you can insert between the rational weights (x)/(D) and
*

*> (X+1)/(D), whose value does not have to be changed, the (N-1) new
*

*> distinct rational collation weights (XN+1)/(ND) to (XN+N-1)/(ND).
*

*>
*

*> Let's just remember that the absolute value of collation weights does
*

*> not matter, only their relative order in the sequence for any specific
*

*> collation level (as well, if collation weights are allocated so that
*

*> you don't need to insert a level separator in the generated collation
*

*> keys, the set of collation weights for level N+1 just has to be fully
*

*> lower than the set of collation weights for level N. And rational
*

*> numbers perfectly fit this need.
*

*>
*

*> Once you have computed all rationals for all levels needed for a
*

*> specific tailoring (or even for the DUCET) a very basic integer
*

*> counting of the ordered sequence of rationals can convert them into
*

*> compact integers that are then easy to insert in collation keys.
*

*>
*

*> Using rationals would also have some benefits: it allows compression
*

*> according to statistics of use: you can start by definining an
*

*> interval of integers (i.e. rationals with denominator 1) from 0 to
*

*> 255: you'll split the sequence of all collation weights according to
*

*> their relative frequency. For more collation weights between two
*

*> elements, count the number you need to insert, compute the new
*

*> denominator, and insert them there.
*

*>
*

*> More formally, you can just start by the rational inclusive interval
*

*> [0/1,1/1] which contains the full sequence of all possible collation
*

*> weights. The position at which you'll split this interval is
*

*> arbitrary, so you can use technics similar to the Arithmetic coding
*

*> (used for data compression), or to the Huffmann coding, to split it
*

*> conveniently with a varying number of bits for each subdivision, based
*

*> on the estimated statistics of use.
*

*>
*

*> With this concept, you'll be able to build extremely compact collation
*

*> keys that will use variable numbers of bits): just append those
*

*> sequences of bits, level by level for all collation elements, then add
*

*> arbitrary padding null bits to fill the collation key into an integer
*

*> number of bytes if you want.
*

*>
*

*> In addition, you'll be able to define *very stable* collation weights
*

*> in the DUCET, now represented as rationals instead of just integers.
*

*>
*

*> Notably, you'll be able to assign *permanent* collation weights for
*

*> the pseudo-collation elements that represent the script separation.
*

*>
*

*> What do you think of this marvelous idea ?
*

*>
*

*> --Philippe.
*

*>
*

Received on Tue Sep 06 2011 - 22:17:59 CDT

*
This archive was generated by hypermail 2.2.0
: Tue Sep 06 2011 - 22:17:59 CDT
*