Re: CaseFirst and CaseLevel Tailorings of UCA and LDML

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Tue, 22 May 2012 22:22:28 +0100

On Tue, 22 May 2012 08:33:43 -0700
Markus Scherer <markus.icu_at_gmail.com> wrote:

> On Tue, May 22, 2012 at 1:09 AM, Richard Wordingham <
> richard.wordingham_at_ntlworld.com> wrote:
>
> > On Mon, 21 May 2012 17:07:33 -0700
> > Markus Scherer <markus.icu_at_gmail.com> wrote:

> > > I can dig up the ICU code that computes the
> > > collation case bits for a string.

It would be helpful. I can't see well enough how the data gets in.

> > Is this code in ICU 4.4.2 (the version for the Linux I run), or
> > should I be looking at ICU 49?
 
> That code is in every version of ICU since we implemented the current
> collation implementation. I bet that part of the collation builder
> code has not changed significantly since ICU 1.8 in 2001... I will
> try to look for it today or tomorrow.

I can't see any relevant difference between Versions 3.6 and 49. Some
of the details I'll have to clarify by running in the debugger. There
are some odd effects that I half understand from reading function
ucol_doCE() in ucol_bld.cpp.

&™=™ seems to change U+2122 TRADE MARK SIGN from <compat> lowercase
tertiary weight tagged as lower case to <compat> lowercase
tertiary weight tagged as upper case! As a consequence, when
CaseFirst=uppercase is selected, it suddenly sorts before the 2-letter
string 'TM'! This seems to be because its decomposition mapping as
<TM> is examined.

On the other hand, &\ua7f8=\ua7f8 has no effect on the sorting of
U+A7F8 MODIFIER LETTER CAPITAL H WITH STROKE, which continues to be
sorted as lower case. This seems to be because it is simple enough
that the CLDR root locale or DUCET is consulted for its casing, and so
it remains stuck at lower case!

I am beginning to believe that it is impossible for ICU users to tailor
U+A7F8 MODIFIER LETTER CAPITAL H WITH STROKE to be upper case!

> CLDR does not publish precise conformance tests for attributes and
> tailorings. I think it's fair to say that a particular attribute
> results in "lower case sorting before upper case" or similar without
> spelling out precisely how edge cases might behave. In my opinion, we
> should give a little bit of wriggle room to implementations.

The ICU user guide says,

"Applications that share sorted data but do not agree on how the data
should be ordered fail to perform correctly. By conforming to the
UCA/14651 standard for collation, independently developed applications,
such as those used for e-business, sort data identically and perform
properly."

> Also, the ICU User Guide should document what it does mean in ICU.

Indeed. Having read it, I wanted to read the reference manual.

Actually, for all their ugliness, 14651 tailorings do seem to be
well-defined, as well as more powerful.

Richard.
Received on Tue May 22 2012 - 16:24:05 CDT

This archive was generated by hypermail 2.2.0 : Tue May 22 2012 - 16:24:06 CDT