Re: CaseFirst and CaseLevel Tailorings of UCA and LDML from Markus Scherer on 2012-05-23 (Unicode Mail List Archive)

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Wed, 23 May 2012 10:35:46 -0700

On Tue, May 22, 2012 at 2:22 PM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> > > > I can dig up the ICU code that computes the
> > > > collation case bits for a string.
>
> It would be helpful. I can't see well enough how the data gets in.
>

I found the code that computes the case bits (2 bits for lower/mixed/upper)
for building ICU tailorings. Search for "getCaseBits" in

Java:
main/classes/collate/src/com/ibm/icu/text/CollationParsedRuleBuilder.java<http://bugs.icu-project.org/trac/browser/icu4j/trunk/main/classes/collate/src/com/ibm/icu/text/CollationParsedRuleBuilder.java>

C++: source/i18n/ucol_bld.cpp<http://bugs.icu-project.org/trac/browser/icu/trunk/source/i18n/ucol_bld.cpp>

Sadly, this code looks fishy. I just submitted
http://bugs.icu-project.org/trac/ticket/9337

It is also clear that the CLDR UCA/DUCET table used in ICU
(FractionalUCA.txt) is built with different code for the case bits that
works for supplementary characters. For example Deseret small/capital
letter long i have the correct case bits in our version of the DUCET but
both get "lower case" bits when tailoring them.

&™=™ seems to change U+2122 TRADE MARK SIGN from <compat> lowercase
> tertiary weight tagged as lower case to <compat> lowercase
> tertiary weight tagged as upper case! As a consequence, when
> CaseFirst=uppercase is selected, it suddenly sorts before the 2-letter
> string 'TM'! This seems to be because its decomposition mapping as
> <TM> is examined.
>

Yes, the first step in getCaseBits() is to normalize to NFKD. However, this
sets the case bits, not the tertiary weight. They are separate in our
implementation. With default collation options, the case bits get ignored
(masked away). With "case level on", they get moved into a separate level
between secondary & tertiary. With "case first" they are retained in the
tertiary-weight byte so that case differences trump other tertiary
differences.

On the other hand, &\ua7f8=\ua7f8 has no effect on the sorting of
> U+A7F8 MODIFIER LETTER CAPITAL H WITH STROKE, which continues to be
> sorted as lower case.

Apparently true, but I don't understand why. I would have to try this in
the debugger. The current getCaseBits() should get the "upper case" bits
from the NFKD version U+0126.

I am beginning to believe that it is impossible for ICU users to tailor
> U+A7F8 MODIFIER LETTER CAPITAL H WITH STROKE to be upper case!
>

You cannot explicitly determine the case bits, only the relative tertiary
weights. The case bits are computed.

markus

-- 
Google Internationalization Engineering

Received on Wed May 23 2012 - 12:38:06 CDT

This archive was generated by hypermail 2.2.0 : Wed May 23 2012 - 12:38:07 CDT