Re: Size of Weights in Unicode Collation Algorithm

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Thu, 14 Mar 2013 14:49:18 -0700

In ICU, setVariableTop() has a documented limitation: It requires that the
primary weight has only 1 or 2 bytes. Until a few years ago, this was true
for most characters. Since then, Unicode added many more characters and we
ran out of space for 2-byte weights, given our constraints. So we use
3-byte primaries for the majority of characters now. See this doc from a
few years ago:
http://site.icu-project.org/design/collation/uca-weight-allocation

Unfortunately, this makes setVariableTop() not work with most
characters<http://bugs.icu-project.org/trac/ticket/8103>.
I believe we have not had any real bug reports about this. I think that
means that very few people care to change the variable top.

As Richard discovered, in UCA+DUCET with 16-bit weights, the same sort of
limitation could be applied, requiring that the primary weight not use
the Large
Weight Values<http://www.unicode.org/draft/reports/tr10/tr10.html#Large_Weight_Values>
mechanism
(that is, that it fits into a single 16-bit weight).

However, it does not make a lot of sense to set the variable top to
something above the currency symbols range -- it's basically an option for
an "ignore punctuation" mode, and you wouldn't want to ignore nearly every
assigned character in Unicode. I am not even sure it makes sense to set it
to a currency symbol. In UCA+DUCET, that means that any sensible
variable-top value does not use the large-weight-values mechanism anyway.

Also, the highest assigned primary in the UCA spec is FFFD, not the
last-explicitly-mentioned mapping's primary. Remember that Han and
unassigned code points get implicit weights, and we have special weights.

In my pending ICU collation rewrite, I do not use the large-weight-values
mechanism any more, so the new code would permit a 4-byte variableTop.
However, we have agreed to replace the hard-to-use variableTop attribute
with an easy-to-use maxVariable attribute, so this whole discussion will
become moot at that point: http://unicode.org/cldr/trac/ticket/5016

Best regards,
markus

-- 
Google Internationalization Engineering
Received on Thu Mar 14 2013 - 16:53:25 CDT

This archive was generated by hypermail 2.2.0 : Thu Mar 14 2013 - 16:53:25 CDT