Re: Size of Weights in Unicode Collation Algorithm

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 16 Mar 2013 18:25:23 +0000

On Sat, 16 Mar 2013 09:29:07 -0700
Markus Scherer <markus.icu_at_gmail.com> wrote:

> On Sat, Mar 16, 2013 at 4:09 AM, Richard Wordingham <
> richard.wordingham_at_ntlworld.com> wrote:
>
> > Please give an example of how the low/high split would fail. With
> > the primary collation weights 20, 21, 21 80 and 22 I get the
> > following primary collation weight sequences for one and two
> > collating elements, marking boundaries of collating elements with
> > commas:
> >
>
> The problem is that if you have 21 and 21 80, and another primary
> starts with 80, you can't distinguish the sequence 21 | 80 from the
> one weight 21 80.

But with the low/high split scheme, start units have to have low values
(e.g. 20, 21 & 22) and continuation units have high values (e.g. 80)
just to stop this very problem.

> > For most uses, in particular, those in DUCET, the trailing units
> > must not be mistakable for variable primary collation elements.

> You have to know which one is a trailing unit. I suppose you could do
> it via ranges like in UTF-8, but that means you can use fewer byte
> values per position and thus yields longer weights, and longer sort
> keys.

With allkeys-type definitions and no more tailoring than strengths and
variable weight schemes (with untailorable variable weight ranges), the
implementation doesn't need to know which are trailing units, unless it
is checking well-formedness. Should it need to know, all it has to
check for is zero level 3 weights.

If the variableTop parametric tailoring parameter is effectively
removed, then a very well-formed table would be such that all four
possibilities for the set of variable primaries selectable by
standard UCA parametric tailoring had a well-formed collection of
variable weights. DUCET does this by ensuring that there are no large
weights in the region of interest, and that keeps sorting
implementations simple once one has split a string (and its
characters!) into collating elements.

The only size-related issue left is specifying how to mimic the odd
behaviour of some ICU rules defining ordering. Perhaps that is not a
UCA issue - the standard UCA parametric tailorings do not call up such
definitions.

Richard.
Received on Sat Mar 16 2013 - 13:30:09 CDT

This archive was generated by hypermail 2.2.0 : Sat Mar 16 2013 - 13:30:09 CDT