Re: CaseFirst and CaseLevel Tailorings of UCA and LDML

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Wed, 23 May 2012 22:01:31 +0100

On Wed, 23 May 2012 10:35:46 -0700
Markus Scherer <markus.icu_at_gmail.com> wrote:

> On Tue, May 22, 2012 at 2:22 PM, Richard Wordingham <
> richard.wordingham_at_ntlworld.com> wrote:

> I found the code that computes the case bits (2 bits for
> lower/mixed/upper) for building ICU tailorings. Search for
> "getCaseBits" in
 
> Java:
> main/classes/collate/src/com/ibm/icu/text/CollationParsedRuleBuilder.java<http://bugs.icu-project.org/trac/browser/icu4j/trunk/main/classes/collate/src/com/ibm/icu/text/CollationParsedRuleBuilder.java>
 
> C++:
> source/i18n/ucol_bld.cpp<http://bugs.icu-project.org/trac/browser/icu/trunk/source/i18n/ucol_bld.cpp>
 
> Sadly, this code looks fishy. I just submitted
> http://bugs.icu-project.org/trac/ticket/9337

While we're picking on that poor routine - it looks as though it could
come unstuck with kana in the supplementary planes - the Kana
Supplement, and possibly also the Enclosed Ideographic Supplement. Do
you want a comment on that added to the ticket, or does that issue
deserve a whole ticket to itself?

Comment 2 in http://bugs.icu-project.org/trac/ticket/9337 seems to be
the answer to my opening question - the case for caseFirst and
caseLevel tailorings is defined, in the absence of non-parametric
tailorings, by FractionalUCA.txt. Is there a definition of the precise
relationship between DUCET and FractionalUCA.txt, or does
FractionalUCA.txt define the relationship? I presume FractionalUCA.txt
takes precedence over UCA_Rules.txt. They do differ - the file
FractionalUCA.txt assigns <U+0FB2, U+034F, U+0F71> and <U+0FB2, U+0F71>
the same 3-level weights, but UCA_Rules.txt assigns them a tertiary
difference. I've reported that in formal Unicode feedback.

> It is also clear that the CLDR UCA/DUCET table used in ICU
> (FractionalUCA.txt) is built with different code for the case bits
> that works for supplementary characters.

A further wrinkle is that case seems more a property of collation
elements than of characters. I haven't checked that one can read back
from case assignments in FractionalUCA.txt to DUCET. (In general,
there need not be an element-to-element mapping between collation
*elements* for equivalent UCA-compliant collations.) At present, the
primarily non-ignorable collation elements of a character of general
category Lt are an uppercase collation element followed by a lowercase
collation element. As you've said, no mixed case in the root locale.

Richard.
Received on Wed May 23 2012 - 16:04:13 CDT

This archive was generated by hypermail 2.2.0 : Wed May 23 2012 - 16:04:14 CDT