Re: Size of Weights in Unicode Collation Algorithm

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Fri, 15 Mar 2013 13:52:39 -0700

On Fri, Mar 15, 2013 at 12:50 PM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> On Thu, 14 Mar 2013 19:13:43 -0700
> Markus Scherer <markus.icu_at_gmail.com> wrote:
>
> > On Thu, Mar 14, 2013 at 4:09 PM, Richard Wordingham <
> > richard.wordingham_at_ntlworld.com> wrote:
> >
> > > On Thu, 14 Mar 2013 14:49:18 -0700
> > > Markus Scherer <markus.icu_at_gmail.com> wrote:
>
> > While variableTop="u2FD5" ...
>
> > ... but a) this is a rarely used option and b) depends on the
> > implementation, and c) it makes no practical sense to make letters
> > ignorable.
>
> That doesn't stop the LDML specification having the example locales
> en-u-vt-0061 and en-u-vt-0061-0065. (I don't see what collating
> elements have variable weight in one but not the other.)
>

Richard, if something doesn't make sense, then let's say so (
http://unicode.org/cldr/trac/ticket/5811) rather than trying to divine what
it means. Someone clearly wrote an example of the -u- syntax with a couple
of code points without thinking about whether those code points made any
sense for that attribute.

> "Fractional weights" is nothing other than the "large weights"
> > mechanism applied to byte-based weights of all levels. The UCA is
> > already "fractional" for implicit primaries.
>
> Not quite. The characterisation of variable weights knows nothing of
> the concept, and that is the problem.

That's a problem in some implementations, but not a problem with the
concept. Nothing prevents you from defining a variableTop that contains a
string of "large weights", and comparing that lexically with the string of
"large weights" that you look up for a character or substring. In fact,
that's really what ICU does, except the current code is limited to
one-or-two units (bytes).

In CLDR/ICU's FractionalUCA.txt, all but 40 or so of the primary weights
(and many of the secondary weights) use the "large weights" mechanism.

The current ICU code then turns that into a data structure which
effectively re-chunks it to 16-bit primary weights (but keeping 8-bit
secondary weights), so that it turns into "large weights" only for
FractionalUCA primaries that are longer than 2 bytes -- but that's still
the majority of entries in the file.

(My new code will not internally use the "large weights" mechanism because
this mechanism makes some of the code dealing with parametric tailorings
fairly complex.)

> And pinning the variable-top value to the next following end of a
> > reordering group, and no higher than the end of the primary-weight
> > range for currency symbols.
>
> Is there a CLDR ticket for this change to the meaning of variableTop?
> I couldn't find one.
>

It looks like this detail was only discussed in writing last July on the
icu-design mailing list when I proposed to deprecate and replace
variableTop. I know that we talked about it in the CLDR meeting as well.

I recently made a note here:
http://bugs.icu-project.org/trac/ticket/9958#comment:2

Otherwise it's
http://unicode.org/cldr/trac/ticket/5016 and
http://bugs.icu-project.org/trac/ticket/8032#comment:7

markus
Received on Fri Mar 15 2013 - 15:55:35 CDT

This archive was generated by hypermail 2.2.0 : Fri Mar 15 2013 - 15:55:37 CDT