Re: Size of Weights in Unicode Collation Algorithm

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 16 Mar 2013 22:24:15 +0000

On Sat, 16 Mar 2013 21:58:02 +0100
Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> 2013/3/16 Richard Wordingham <richard.wordingham_at_ntlworld.com>:
> > On Sat, 16 Mar 2013 09:29:07 -0700
> > Markus Scherer <markus.icu_at_gmail.com> wrote:

> >> On Sat, Mar 16, 2013 at 4:09 AM, Richard Wordingham <
> >> richard.wordingham_at_ntlworld.com> wrote:

> >> > Please give an example of how the low/high split would fail. With
> >> > the primary collation weights 20, 21, 21 80 and 22 I get the
> >> > following primary collation weight sequences for one and two
> >> > collating elements, marking boundaries of collating elements with
> >> > commas:

> >> The problem is that if you have 21 and 21 80, and another primary
> >> starts with 80, you can't distinguish the sequence 21 | 80 from the
> >> one weight 21 80.

> > But with the low/high split scheme, start units have to have low
> > values (e.g. 20, 21 & 22) and continuation units have high values
> > (e.g. 80) just to stop this very problem.

> Actually no, this is not enough. The scheme cannot just be start vs.
> continuation, but non-final vs final. The encoding of weights must be
> done so that any encoded weight MUST NOT be a prefix of another
> encoded weight.

If you start with my start = low, continuation = high scheme, you can
convert it in an order-preserving manner to a no-prefix scheme by
the following simple transform:

   If a simple weight precedes a continuation weight, add 08 ('' is
   serving as the hexadecimal point) to it.

Thus 21, 21 81 and 21 81 81 become 21, 218 81 and 218 818 81. If
you don't like semi-integral values, discard the high bit and double,
yielding 42, 43 02 and 43 03 02. You may recognise your non-final v.
final scheme! (I replaced '80' by '81' to avoid confusing zeroes.)

If you're still not convinced, please show me what goes wrong.

Richard.
Received on Sat Mar 16 2013 - 17:29:57 CDT

This archive was generated by hypermail 2.2.0 : Sat Mar 16 2013 - 17:30:04 CDT