From: Philippe Verdy <verdy_p_at_wanadoo.fr>

Date: Wed, 7 Sep 2011 06:52:09 +0200

Date: Wed, 7 Sep 2011 06:52:09 +0200

I expect to publish soon a completely different way to represent the

DUCET in a way compatible with the rationals representation. In fact

this new dataset will be FULLY compatible with it (or also with the

CLDR root derived-DUCET), without even needing to specify in the data

*any* numeric value (say good bye to *all* numeric weights in this

datafile, and as well to the existing very complex representation in

CLDR, which needlessly tries to assign fixed collation levels).

My representation will in fact ONLY contain ordered lists of collation

elements. The primary level will contain only 10 pseudo-elements:

<ignorables>,<variables>,<whitespaces>,<punctuations>,<symbols>,<currencyunits>,<numbers>,<letters><privateuses>,<levelterminator>

The secondary level will list only the "leaders" in the sequences of

collation elements, listed after the 11 pseudo-elements above. But New

pseudo elements will be added for scripts in the subsequence between

<letters> and <privateuses>.

The ternary level will list only the existing "leaders" for currently

differentiated at the primary level in the existing DUCET.

The format will allow specifying arbitrary rational values to some

collation elements (in ascending order listed at each level), based on

estimated statistics of use of the collation elements in the sequence

(it will be in fact given as a percentage). All other collation

elements will be assigned rational values algorithmically. Typically,

a tailoring could change those statistics to compute themselves a good

rational approximation of the statistics for the fixed elements, and

then will assign the other unspecified weights, using a linear

distribution in the rational range delimited by collation elements

with assigned statistics. Those statistics will not need to be

specified for all collation elements, only for a few significant

subranges of collation elements in the sequence.

Pseudo-elements in the sequences collation elements will be

automatically assigned an unmodifiable statistics of "EPSILON", i.e.

the smallest rational representable =1/N, where N is any integer

between the number of collation elements and pseudo-elements in the

subsequence, and the largest acceptable integer supported by the

implementation). If a tailoring assigns a statistics lower than this

EPSILON value, it will be automatically increased to be at least this

value, and the sum of statistics for all collation elements in the

subsequence will then be checked to make sure that it does not exceed

100%, after temporarily assigning this epsilon value to all collation

elements with unspecified statistics; if this 100% is exceeded, the

EPSILON value will be reduced and all other statistics will be

proportionally adjusted.

This makes all subsequences conforming so that each one can get a well

defined proportional represetnation as a subsegment of rationals over

the hald-open [0/1..1/1[ range of rationals.

I just decided to remove the last boundary 1/1, it is not necessary

and later complicates the conversion to bitstrings in the Huffman or

arithmetic conversion used when generating collation keys.

But when only comparing pairs of strings, where no collation key is

used or needed, we can just compare the rational weights without

generating Huffman or arithmetic bitstrings, which is MUCH faster, and

there's no need in this case to use the statistics. The rationals

don't have to be tuned per locale to match their frequencies of use

for just comparing texts.That's why their specification will be

unspecified for the use in the DUCET or CLDR root (all the rational

weights will be generated by a very basic algorithmic counting the

number of collation elements in each subsequence to compute their

common denominator, and then assigning numerators as an integer

sequence).

It will be implied that the first element in the sequence for a given

elevel, will be used to represent all ignorable collation elements at

this level, and the last will be used for the level terminator (the

only pseudo-element to keep in this sequence, and that you won't even

need to specify in the data set).

So the data set will consist only in lists of collation elements (for

example as a comma-separated list, the separation could as well be

newlines, or "<" or just spaces or tabs to allow commenting the list

after a "#" sharp sign up to the newline), and nothing else. Using a

syntax that does need to specify the level number between each

element, it will be MUCH more compact than even the existing

"abbreviated" syntax used in LDML. No complex parser except for the

isolated collation elements themselves.

There will be no "reset" at all, a single reset will be implied at the

begining of each sequence for a given level N, or if a collation

element matches <levelterminator>. It will be implicit that levels

will need to be specified separately in ascending order, and it will

not be necessary to assign numeric values to these levels, only

possibly symbolic names (such as "accents" or "cases"), if one want to

swap levels in a subsequent tailoring.

With some help, I could be able to demonstrate a basic implementation

in Java with almost no use of object programming, about how to infer

the rational weights (for the simplest case where you don't want to

tune the statistics to make generated collation keys optimally

compressed with Huffmann or arithmetic coding), from the simpler data

file (optionally but later, the code would demonstrate how to use the

existing DUCET format (using the integer numeric values exposed there,

as a way to define the numerator of a rational whose denominator would

be infered ; but I am now convinced that this DUCET format currently

used in the UCD is creating more troubles than it helps).

-- Philippe.

2011/9/7 Philippe Verdy <verdy_p_at_wanadoo.fr>:

*> I've just tried this idea, and this is working magically as I wanted.
*

*> It generates extremely compact collation keys, notably if I no longer
*

*> restrict collation levels so that they use a limited range: instead I
*

*> explicitly use a level separator, to which I assign the final range of
*

*> rationals [7/16 .. 16/16], that I can represent as 4 bits (1111).
*

*>
*

*> Then each level can use the collation weights in the half-open range
*

*> [0/16 .. 7/16[, which can be splitted in as many elements as needed. I
*

*> have also used the [0/16 .. 1/16[ half-open range for the weight
*

*> representing ignorable elements: this will use 4 bits(0000) in the
*

*> collation keys. This leaves the half-open range [1/16 .. 7/16[ for
*

*> representing all non-ignorable weights for this level.
*

*>
*

*> For the primary level, I can predefined fixed ranges of rationals for
*

*> representing variable elements, whitespaces, symbols, punctuations,
*

*> numbers, and then letters for each script (also with fixed ranges for
*

*> each script).
*

*>
*

*> Loading tailorings becomes extremely easy when using rationals, the
*

*> algorithms become MUCH simpler.
*

*>
*

*> And finally, once you have assigned rationals to all collation
*

*> elements for a specific tailoring (including for the DUCET itself), it
*

*> is an extremely easy transform to infer integer weights. This means
*

*> that you no longer need to use gaps in the sequences of default
*

*> weights, or to given them special meanings. But my opinion is that
*

*> this representation using integers is much more difficult to compress
*

*> in collation keys.
*

*>
*

*> In fact, in a specific locale, where collation elements have some good
*

*> estimation of the frequency of use of collation elements, it is easy
*

*> to assign them an associated fixed rational, and to place all other
*

*> collation elements by spliting the intervals as needed to fit. And you
*

*> get extremely compact collation keys represented eigher by Huffman
*

*> encoding (or by arithmetic codding if you have a licence for it from
*

*> IBM, but this does not reduces much the length of collation keys).
*

Received on Tue Sep 06 2011 - 23:57:59 CDT

*
This archive was generated by hypermail 2.2.0
: Tue Sep 06 2011 - 23:58:01 CDT
*