Re: locale-aware string comparisons

From: Mark Davis ☕ <mark_at_macchiato.com>
Date: Tue, 1 Jan 2013 15:46:45 -0800

> 3. Regarding LDML and CLDR, somebody with specific expertise on CLDR

James,
Even without locale differences, the situation is a bit tricky. Assuming
that str_tolower() and str_toupper() were straightforwardly defined in
terms of the (full) Unicode case mappings, there is still the issue that
the DUCET does not define a caseless compare. It puts case together with
other variants into a set of "Level 3" data. There are 3 approaches one can
take with a strcasecmp() straightforwardly based on LDML. I generated some
numbers for these with a quick test program, but note that they use the
CLDR root locale, which has a few changes from DUCET.

A. Define it to be just comparing after Unicode case folding.

B. Use DUCET and only compare according to Level 1 & 2. That ignores case,
but also some other features.

C. Use the case level as defined in LDML, plus Levels 1 & 2.

All of these are different, all of them still have over 200 differences
from either compare(lower(x),lower(y)) or compare(upper(x),upper(y)) These
are mostly because special weighting of compatibility variants, or of the
Greek iota subscript. Example:

s < ſ, but upper( s ) = upper( ſ ) // LATIN SMALL LETTER S vs LATIN SMALL
LETTER LONG S

Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**

On Mon, Dec 31, 2012 at 3:29 PM, Whistler, Ken <ken.whistler_at_sap.com> wrote:

> Well, in answering the question which was actually posed here:
>
> 1. ISO/IEC 10646 has absolutely nothing to say about this issue, because
> 10646 does not define case mapping at all.
>
> 2. The Unicode Standard *does* define case mapping, of course, as well as
> case folding. The relevant details are in Section 3.13 of the standard,
> supported by various data files in the Unicode Character Database. TUS 6.2,
> Section 3.13, p. 117, does define toUpperCase(X) and toLowerCase(X), but
> those are string mapping operations, not directly comparable to Linux (and
> in general Unix) toupper() and tolower(), which are character mapping
> functions. The closer correlates to Linux toupper() and tolower() are
> Unicode's definitions of Uppercase_Mapping(C) and Lowercase_Mapping(C).
> However, there is a significant difference lurking, in that the Unicode
> case mapping definitions are not locale-sensitive. The full case mappings
> do include two conditional sets of mappings (from SpecialCasing.txt) for
> Lithuanian and for Turkish and Azeri, mostly affecting the behavior of the
> dot on "i", but the use of those conditional mappings depends on the
> availability of explicit language context.
>
> This contrasts with the Linux (and in general Unix) toupper() and
> tolower() functions, which in principle, at least, are locale-sensitive,
> depending on the current locale setting, and in particular on whether the
> LC_CTYPE category in the locale has a non-null list of mappings for toupper
> and/or tolower in it.
>
> Perhaps even more importantly, the Unicode Standard does not state
> anything regarding the details of the behavior of the APIs strcasecmp() or
> tolower() or toupper() in libc. Those are the concerns of the C and POSIX
> specs, not the Unicode Standard. Nor could the Unicode Standard really get
> involved in this, precisely because that behavior involves locales, and
> locales are outside the scope of the Unicode Standard.
>
> 3. Regarding LDML and CLDR, somebody with specific expertise on CLDR may
> have to jump in here, but while locales clearly *are* in the scope of LDML
> and CLDR, there is currently little if anything they have to say about
> specific case mapping rules.
>
> As regards the particulars of the question, I suspect that it would depend
> in part on how strcasecmp(), str_tolower() and str_toupper() are
> implemented (I am assuming string conversions APIs here based on the
> tolower() and toupper() APIs), but there probably *are* instances where the
> results would diverge. The most likely source of trouble would be Turkish
> case mapping. In particular, if you compare U+0130 LATIN CAPITAL LETTER I
> WITH DOT ABOVE to a canonically equivalent sequence of <U+0049, U+0307>,
> there may be conundrums. If strcasecmp() is implemented based on Turkish
> case folding, then strcasecmp( U+0130, <U+0049, U+0307> ) == 0. If
> str_tolower() is based on Turkish case mapping, then str_tolower( U+0130 )
> == <U+0069, U+0307>, so strcmp(str_tolower( U+0130), str_ tolower(
> <U+0049,U+0307> ) == 0, *but* str_toupper( U+0130 ) == U+0130 and
> str_toupper( <U+0049,U+0307> ) == <U+0049,U+0307>, so strcmp(str_toupper(
> U+0130 ), str_toupper( <U+0049,U+0307> ) != 0. The two upperc!
> ased versions are *canonically* equivalent, but you wouldn't expect a
> strcmp() operation to be checking normalization of strings. So unless the
> implementations of str_tolower() and str_ toupper() were doing canonical
> normalization as well as case mapping, you could indeed find some odd edge
> cases for Turkish casing, at least.
>
> --Ken
>
> > Given (just) the data in 10646, Unicode and cldr, are there any locales
> > where a case-insensitive match should be different than a case-preserving
> > match of the results of lower-casing the two strings?
> >
> > Ie, in terms of locale-aware versions of the typical libc functions,
> > should strcasecmp(s1,s2) ever generate different results than
> > strcmp(tolower(s1),tolower(s2)) or strcmp(toupper(s1),toupper(s2))?
> > (By mentioning strcmp() et al, I do not exclude mb or w versions of
> > those functions.)
> >
> > And to be clear, the questions isn't about any specific, existing
> > implementation but only about what the 10646, unicode and cldr suite
> > of standards have to say on the matter.
> >
> > Thanks,
> >
> > -JimC
> > --
> > James Cloos <cloos_at_jhcloos.com> OpenPGP: 1024D/ED7DAEA6
>
>
>
>
Received on Tue Jan 01 2013 - 17:52:15 CST

This archive was generated by hypermail 2.2.0 : Tue Jan 01 2013 - 17:52:17 CST