RE: locale-aware string comparisons

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Mon, 31 Dec 2012 23:29:48 +0000

Well, in answering the question which was actually posed here:

1. ISO/IEC 10646 has absolutely nothing to say about this issue, because 10646 does not define case mapping at all.

2. The Unicode Standard *does* define case mapping, of course, as well as case folding. The relevant details are in Section 3.13 of the standard, supported by various data files in the Unicode Character Database. TUS 6.2, Section 3.13, p. 117, does define toUpperCase(X) and toLowerCase(X), but those are string mapping operations, not directly comparable to Linux (and in general Unix) toupper() and tolower(), which are character mapping functions. The closer correlates to Linux toupper() and tolower() are Unicode's definitions of Uppercase_Mapping(C) and Lowercase_Mapping(C). However, there is a significant difference lurking, in that the Unicode case mapping definitions are not locale-sensitive. The full case mappings do include two conditional sets of mappings (from SpecialCasing.txt) for Lithuanian and for Turkish and Azeri, mostly affecting the behavior of the dot on "i", but the use of those conditional mappings depends on the availability of explicit language context.

This contrasts with the Linux (and in general Unix) toupper() and tolower() functions, which in principle, at least, are locale-sensitive, depending on the current locale setting, and in particular on whether the LC_CTYPE category in the locale has a non-null list of mappings for toupper and/or tolower in it.

Perhaps even more importantly, the Unicode Standard does not state anything regarding the details of the behavior of the APIs strcasecmp() or tolower() or toupper() in libc. Those are the concerns of the C and POSIX specs, not the Unicode Standard. Nor could the Unicode Standard really get involved in this, precisely because that behavior involves locales, and locales are outside the scope of the Unicode Standard.

3. Regarding LDML and CLDR, somebody with specific expertise on CLDR may have to jump in here, but while locales clearly *are* in the scope of LDML and CLDR, there is currently little if anything they have to say about specific case mapping rules.

As regards the particulars of the question, I suspect that it would depend in part on how strcasecmp(), str_tolower() and str_toupper() are implemented (I am assuming string conversions APIs here based on the tolower() and toupper() APIs), but there probably *are* instances where the results would diverge. The most likely source of trouble would be Turkish case mapping. In particular, if you compare U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE to a canonically equivalent sequence of <U+0049, U+0307>, there may be conundrums. If strcasecmp() is implemented based on Turkish case folding, then strcasecmp( U+0130, <U+0049, U+0307> ) == 0. If str_tolower() is based on Turkish case mapping, then str_tolower( U+0130 ) == <U+0069, U+0307>, so strcmp(str_tolower( U+0130), str_ tolower( <U+0049,U+0307> ) == 0, *but* str_toupper( U+0130 ) == U+0130 and str_toupper( <U+0049,U+0307> ) == <U+0049,U+0307>, so strcmp(str_toupper( U+0130 ), str_toupper( <U+0049,U+0307> ) != 0. The two uppercased versions are *canonically* eq
uivalent, but you wouldn't expect a strcmp() operation to be checking normalization of strings. So unless the implementations of str_tolower() and str_ toupper() were doing canonical normalization as well as case mapping, you could indeed find some odd edge cases for Turkish casing, at least.

--Ken

> Given (just) the data in 10646, Unicode and cldr, are there any locales
> where a case-insensitive match should be different than a case-preserving
> match of the results of lower-casing the two strings?
>
> Ie, in terms of locale-aware versions of the typical libc functions,
> should strcasecmp(s1,s2) ever generate different results than
> strcmp(tolower(s1),tolower(s2)) or strcmp(toupper(s1),toupper(s2))?
> (By mentioning strcmp() et al, I do not exclude mb or w versions of
> those functions.)
>
> And to be clear, the questions isn't about any specific, existing
> implementation but only about what the 10646, unicode and cldr suite
> of standards have to say on the matter.
>
> Thanks,
>
> -JimC
> --
> James Cloos <cloos_at_jhcloos.com> OpenPGP: 1024D/ED7DAEA6
Received on Mon Dec 31 2012 - 17:37:16 CST

This archive was generated by hypermail 2.2.0 : Mon Dec 31 2012 - 17:37:25 CST