Re: locale-aware string comparisons from Philippe Verdy on 2012-12-29 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 29 Dec 2012 17:49:17 +0100

Case-insensitive searches should not use tolower() or toupper() to convert
strings before comparing them. Yes cases where this could be different
exist and this is caused by the fact that case are not always in simple
pairs, or cases where the conversion to lowercase or uppercase drops other
distinctions than just case differences (e.g. the final sigma in Greek, and
some rules for the German Ess-Tsett, or the long-form s, and its ligatures).
It would be safer to use "casefolding", which does not enforce the
conversion to lowercase, and preserves other semantics.

2012/12/29 James Cloos <cloos_at_jhcloos.com>

> Given (just) the data in 10646, Unicode and cldr, are there any locales
> where a case-insensitive match should be different than a case-preserving
> match of the results of lower-casing the two strings?
>
> Ie, in terms of locale-aware versions of the typical libc functions,
> should strcasecmp(s1,s2) ever generate different results than
> strcmp(tolower(s1),tolower(s2)) or strcmp(toupper(s1),toupper(s2))?
> (By mentioning strcmp() et al, I do not exclude mb or w versions of
> those functions.)
>
> And to be clear, the questions isn't about any specific, existing
> implementation but only about what the 10646, unicode and cldr suite
> of standards have to say on the matter.
>
> Thanks,
>
> -JimC
> --
> James Cloos <cloos_at_jhcloos.com> OpenPGP: 1024D/ED7DAEA6
>
>
Received on Sat Dec 29 2012 - 10:53:40 CST

This archive was generated by hypermail 2.2.0 : Sat Dec 29 2012 - 10:53:42 CST