Re: Size of Weights in Unicode Collation Algorithm

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Thu, 14 Mar 2013 19:13:43 -0700

On Thu, Mar 14, 2013 at 4:09 PM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> On Thu, 14 Mar 2013 14:49:18 -0700
> Markus Scherer <markus.icu_at_gmail.com> wrote:
>
> > However, it does not make a lot of sense to set the variable top to
> > something above the currency symbols range -- it's basically an
> > option for an "ignore punctuation" mode, and you wouldn't want to
> > ignore nearly every assigned character in Unicode.
>
> There are a lot of characters in the SIP!

Richard, we are talking about collation here, and "variable top" works by
comparing primaries with a threshold.
With "above the currency symbols range" I of course meant "above" in
collation order.

While variableTop="u2FD5"
> would probably be a mistake or a mischievous experiment, some might be
> tempted to blot out all non-Han characters! I don't think there is a
> real problem yet, but it is an annoying fact that there can be a
> difference depending on whether one uses 16- or 32-bit weights.

Yes, but a) this is a rarely used option and b) depends on the
implementation, and c) it makes no practical sense to make letters
ignorable.

The
> good news is that there is a solution, namely to introduce fractional
> weights to the allkeys format under the headings of 'large weights' and
> 'escape hatch'.
>

"Fractional weights" is nothing other than the "large weights" mechanism
applied to byte-based weights of all levels. The UCA is already
"fractional" for implicit primaries.

> However, we have agreed to replace the
> > hard-to-use variableTop attribute with an easy-to-use maxVariable
> > attribute, so this whole discussion will become moot at that point:
> > http://unicode.org/cldr/trac/ticket/5016
>
> Actually, you've only proposed deprecating it.
>

And pinning the variable-top value to the next following end of a
reordering group, and no higher than the end of the primary-weight range
for currency symbols.

markus
Received on Thu Mar 14 2013 - 21:18:48 CDT

This archive was generated by hypermail 2.2.0 : Thu Mar 14 2013 - 21:18:49 CDT