Re: CaseFirst and CaseLevel Tailorings of UCA and LDML

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Mon, 21 May 2012 17:07:33 -0700

On Mon, May 21, 2012 at 4:37 PM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> What are the definitions of upper and lower case for the caseFirst
> tailoring for the UCA and for LDML? I can't find any obvious
> definition.
>

I am having trouble finding a published definition too. I suggest you
submit a CLDR ticket for this. http://unicode.org/cldr/trac/newticket

In principle, it's straightforward: Lowercase and uppercase follow Unicode
(UCD) case properties. We distinguish an intermediate "mixed case" for
titlecase characters and mixed-case contractions. I believe we also
distinguish small/normal Kana as lowercase/uppercase. I can dig up the ICU
code that computes the collation case bits for a string.

I don't know whether CLDR/LDML should require all of the details, but there
should at least be informative documentation.

When you turn on the case level or use a caseFirst option, these case bits
are used before (or instead of) the tertiary weights. When you use "normal"
3-level sorting, the case bits are ignored and only the tertiary weights
are used.

The tertiary weights themselves are separate, and based on a mix of
criteria.

Best regards,
markus

-- 
Google Internationalization Engineering
Received on Mon May 21 2012 - 19:09:31 CDT

This archive was generated by hypermail 2.2.0 : Mon May 21 2012 - 19:09:32 CDT