Re: Size of Weights in Unicode Collation Algorithm

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 16 Mar 2013 11:09:37 +0000

On Fri, 15 Mar 2013 21:12:48 -0700, Markus Scherer wrote:

> On Fri, Mar 15, 2013 at 6:52 PM, Richard Wordingham wrote:
(Well, actually the send button was pressed at 01.52 GMT on Saturday.)

> > > The point is that no sequence of
> > > units (8-bit, 16-bit or whatever the implementation uses) can be
> > > an exact prefix of another sequence.

> > That's only for efficiency.
 
> No, it's critical for correctness.

> > One could allocate low unit values to the
> > start units and high unit values to continuation units.
(Paragraph split in this post, for greatly improved clarity.)

Please give an example of how the low/high split would fail. With the
primary collation weights 20, 21, 21 80 and 22 I get the following
primary collation weight sequences for one and two collating elements,
marking boundaries of collating elements with commas:

20
20, 20
20, 21
20, 21 80
20, 22
21
21, 20
21, 21
21, 21 80
21, 22
21 80
21 80, 20
21 80, 21
21 80, 21 80
21 80, 22
22
22, 20
22, 21
22, 21 80
22, 22

They seem to be in perfect order to me.

> > By using
> > high values for continuation units, DUCET simplifies the
> > identification'

> One could pick nearly any range for the trailing units. With the UCA
> spec using 16-bit units and only 21 bits to encode in a pair, there
> is nearly free choice for the range of trail units.

For most uses, in particular, those in DUCET, the trailing units must
not be mistakable for variable primary collation elements. Before
positive non-variable primary weights less than variable primary
weights were allowed, it was very easy to check for such a problem as
one read in an allkeys-style UCET. (It's still very easy if the first
positive weight is variable, as in allkeys.txt itself.)

Richard.
Received on Sat Mar 16 2013 - 06:15:16 CDT

This archive was generated by hypermail 2.2.0 : Sat Mar 16 2013 - 06:15:18 CDT