Re: PRI #203: Proposed Update UTS #10: Unicode Collation Algorithm

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 7 Sep 2011 06:52:09 +0200

I expect to publish soon a completely different way to represent the
DUCET in a way compatible with the rationals representation. In fact
this new dataset will be FULLY compatible with it (or also with the
CLDR root derived-DUCET), without even needing to specify in the data
*any* numeric value (say good bye to *all* numeric weights in this
datafile, and as well to the existing very complex representation in
CLDR, which needlessly tries to assign fixed collation levels).

My representation will in fact ONLY contain ordered lists of collation
elements. The primary level will contain only 10 pseudo-elements:
<ignorables>,<variables>,<whitespaces>,<punctuations>,<symbols>,<currencyunits>,<numbers>,<letters><privateuses>,<levelterminator>

The secondary level will list only the "leaders" in the sequences of
collation elements, listed after the 11 pseudo-elements above. But New
pseudo elements will be added for scripts in the subsequence between
<letters> and <privateuses>.

The ternary level will list only the existing "leaders" for currently
differentiated at the primary level in the existing DUCET.

The format will allow specifying arbitrary rational values to some
collation elements (in ascending order listed at each level), based on
estimated statistics of use of the collation elements in the sequence
(it will be in fact given as a percentage). All other collation
elements will be assigned rational values algorithmically. Typically,
a tailoring could change those statistics to compute themselves a good
rational approximation of the statistics for the fixed elements, and
then will assign the other unspecified weights, using a linear
distribution in the rational range delimited by collation elements
with assigned statistics. Those statistics will not need to be
specified for all collation elements, only for a few significant
subranges of collation elements in the sequence.

Pseudo-elements in the sequences collation elements will be
automatically assigned an unmodifiable statistics of "EPSILON", i.e.
the smallest rational representable =1/N, where N is any integer
between the number of collation elements and pseudo-elements in the
subsequence, and the largest acceptable integer supported by the
implementation). If a tailoring assigns a statistics lower than this
EPSILON value, it will be automatically increased to be at least this
value, and the sum of statistics for all collation elements in the
subsequence will then be checked to make sure that it does not exceed
100%, after temporarily assigning this epsilon value to all collation
elements with unspecified statistics; if this 100% is exceeded, the
EPSILON value will be reduced and all other statistics will be
proportionally adjusted.

This makes all subsequences conforming so that each one can get a well
defined proportional represetnation as a subsegment of rationals over
the hald-open [0/1..1/1[ range of rationals.

I just decided to remove the last boundary 1/1, it is not necessary
and later complicates the conversion to bitstrings in the Huffman or
arithmetic conversion used when generating collation keys.

But when only comparing pairs of strings, where no collation key is
used or needed, we can just compare the rational weights without
generating Huffman or arithmetic bitstrings, which is MUCH faster, and
there's no need in this case to use the statistics. The rationals
don't have to be tuned per locale to match their frequencies of use
for just comparing texts.That's why their specification will be
unspecified for the use in the DUCET or CLDR root (all the rational
weights will be generated by a very basic algorithmic counting the
number of collation elements in each subsequence to compute their
common denominator, and then assigning numerators as an integer
sequence).

It will be implied that the first element in the sequence for a given
elevel, will be used to represent all ignorable collation elements at
this level, and the last will be used for the level terminator (the
only pseudo-element to keep in this sequence, and that you won't even
need to specify in the data set).

So the data set will consist only in lists of collation elements (for
example as a comma-separated list, the separation could as well be
newlines, or "<" or just spaces or tabs to allow commenting the list
after a "#" sharp sign up to the newline), and nothing else. Using a
syntax that does need to specify the level number between each
element, it will be MUCH more compact than even the existing
"abbreviated" syntax used in LDML. No complex parser except for the
isolated collation elements themselves.

There will be no "reset" at all, a single reset will be implied at the
begining of each sequence for a given level N, or if a collation
element matches <levelterminator>. It will be implicit that levels
will need to be specified separately in ascending order, and it will
not be necessary to assign numeric values to these levels, only
possibly symbolic names (such as "accents" or "cases"), if one want to
swap levels in a subsequent tailoring.

With some help, I could be able to demonstrate a basic implementation
in Java with almost no use of object programming, about how to infer
the rational weights (for the simplest case where you don't want to
tune the statistics to make generated collation keys optimally
compressed with Huffmann or arithmetic coding), from the simpler data
file (optionally but later, the code would demonstrate how to use the
existing DUCET format (using the integer numeric values exposed there,
as a way to define the numerator of a rational whose denominator would
be infered ; but I am now convinced that this DUCET format currently
used in the UCD is creating more troubles than it helps).

-- Philippe.

2011/9/7 Philippe Verdy <verdy_p_at_wanadoo.fr>:
> I've just tried this idea, and this is working magically as I wanted.
> It generates extremely compact collation keys, notably if I no longer
> restrict collation levels so that they use a limited range: instead I
> explicitly use a level separator, to which I assign the final range of
> rationals [7/16 .. 16/16], that I can represent as 4 bits (1111).
>
> Then each level can use the collation weights in the half-open range
> [0/16 .. 7/16[, which can be splitted in as many elements as needed. I
> have also used the [0/16 .. 1/16[ half-open range for the weight
> representing ignorable elements: this will use 4 bits(0000) in the
> collation keys. This leaves the half-open range [1/16 .. 7/16[ for
> representing all non-ignorable weights for this level.
>
> For the primary level, I can predefined fixed ranges of rationals for
> representing variable elements, whitespaces, symbols, punctuations,
> numbers, and then letters for each script (also with fixed ranges for
> each script).
>
> Loading tailorings becomes extremely easy when using rationals, the
> algorithms become MUCH simpler.
>
> And finally, once you have assigned rationals to all collation
> elements for a specific tailoring (including for the DUCET itself), it
> is an extremely easy transform to infer integer weights. This means
> that you no longer need to use gaps in the sequences of default
> weights, or to given them special meanings. But my opinion is that
> this representation using integers is much more difficult to compress
> in collation keys.
>
> In fact, in a specific locale, where collation elements have some good
> estimation of the frequency of use of collation elements, it is easy
> to assign them an associated fixed rational, and to place all other
> collation elements by spliting the intervals as needed to fit. And you
> get extremely compact collation keys represented eigher by Huffman
> encoding (or by arithmetic codding if you have a licence for it from
> IBM, but this does not reduces much the length of collation keys).
Received on Tue Sep 06 2011 - 23:57:59 CDT

This archive was generated by hypermail 2.2.0 : Tue Sep 06 2011 - 23:58:01 CDT