Re: PRI #203: Proposed Update UTS #10: Unicode Collation Algorithm

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 7 Sep 2011 04:39:21 +0200

2011/8/31 <announcements_at_unicode.org>:
> PRI #203    Proposed Update for UTS #10: Unicode Collation Algorithm
> http://www.unicode.org/review/pri203/

I just have had an idea, which may be interesting. Currently, the
collation weights defined in the DUCET for each level are maintained
as a list of integers only.

The main problem is that at each revision of the UCS (when characters
get added), or for the need to correct the relative weights of
characters, the collation weights need to be renumbered (notably for
the primary level, but in fact as well for the subsequent levels, if
there's a need to make more numeric space available for higher
collation levels that normally have lower collation weight values).

Why not using a representation not based on integers, but instead on
*rationals* ? We can indefinitely split a rational interval without
having to renumber all other nearby collation elements at the same or
lower collation levels.

And when designing a tailoring, instead of complex renumbering
strategies, we can as well subdivide intervals into as many rational
subintervals as we want (just choose the denominator of the inserted
rationals as the product of the denominator D for the existing level
you want to split, by the number N of (element to insert, plus one),
and then you can insert between the rational weights (x)/(D) and
(X+1)/(D), whose value does not have to be changed, the (N-1) new
distinct rational collation weights (XN+1)/(ND) to (XN+N-1)/(ND).

Let's just remember that the absolute value of collation weights does
not matter, only their relative order in the sequence for any specific
collation level (as well, if collation weights are allocated so that
you don't need to insert a level separator in the generated collation
keys, the set of collation weights for level N+1 just has to be fully
lower than the set of collation weights for level N. And rational
numbers perfectly fit this need.

Once you have computed all rationals for all levels needed for a
specific tailoring (or even for the DUCET) a very basic integer
counting of the ordered sequence of rationals can convert them into
compact integers that are then easy to insert in collation keys.

Using rationals would also have some benefits: it allows compression
according to statistics of use: you can start by definining an
interval of integers (i.e. rationals with denominator 1) from 0 to
255: you'll split the sequence of all collation weights according to
their relative frequency. For more collation weights between two
elements, count the number you need to insert, compute the new
denominator, and insert them there.

More formally, you can just start by the rational inclusive interval
[0/1,1/1] which contains the full sequence of all possible collation
weights. The position at which you'll split this interval is
arbitrary, so you can use technics similar to the Arithmetic coding
(used for data compression), or to the Huffmann coding, to split it
conveniently with a varying number of bits for each subdivision, based
on the estimated statistics of use.

With this concept, you'll be able to build extremely compact collation
keys that will use variable numbers of bits): just append those
sequences of bits, level by level for all collation elements, then add
arbitrary padding null bits to fill the collation key into an integer
number of bytes if you want.

In addition, you'll be able to define *very stable* collation weights
in the DUCET, now represented as rationals instead of just integers.

Notably, you'll be able to assign *permanent* collation weights for
the pseudo-collation elements that represent the script separation.

What do you think of this marvelous idea ?

--Philippe.
Received on Tue Sep 06 2011 - 21:44:55 CDT

This archive was generated by hypermail 2.2.0 : Tue Sep 06 2011 - 21:44:57 CDT