Re: locale-aware string comparisons

From: Richard Wordingham <>
Date: Sat, 19 Jan 2013 12:02:26 +0000

On 29 December 2012, James Cloos asked:
>> Given (just) the data in 10646, Unicode and cldr, are there any
>> locales where a case-insensitive match should be different than a
>> case-preserving match of the results of lower-casing the two
>> strings?

On Mon, 31 Dec 2012 23:29:48, "Whistler, Ken" <>

> 3. Regarding LDML and CLDR, somebody with specific expertise on CLDR
> may have to jump in here, but while locales clearly *are* in the
> scope of LDML and CLDR, there is currently little if anything they
> have to say about specific case mapping rules.

Mark Davis has answered this in part. However, there is one set of
differences that have not been mentioned at all - digraphs treated as
letters, e.g. in Welsh and Danish. The key problem with these,
especially with "ng" in Welsh (where g < ng < h), is that sometimes the
sequence is a digraph and sometimes not. With camel case words (and a
good case for Welsh is Scottish surnames such as McHenry - 'ch' is a
digraph in Welsh, but obviously not in this name), digraphs do not
(exceptions, anyone?) straddle the case-marked boundaries.
Accordingly, in Welsh we have 'ce' < 'ci' < 'ch', 'Ce' < 'Ci' < 'Ch',
'CE' < 'CI' < 'CH', but 'cE' < 'cH' < 'cI'. A solution, if you care
greatly about correctness (CLDR does not), to preprocess sequences of
lower case followed by upper case by inserting CGJ, i.e. U+034F
COMBINING GRAPHEME JOINER. As far as I am aware, this only affects
sequences of general category Ll followed by Lu. (I haven't checked
CLDR for special collation rules for any sequences of Ll followed by
Lu - do check before using my proposed solution.)

For most languages, there are the problems that CGJ is not provided on
keyboards and that CGJ is misrendered by old rendering systems.

Received on Sat Jan 19 2013 - 06:10:58 CST

This archive was generated by hypermail 2.2.0 : Sat Jan 19 2013 - 06:11:05 CST