If you really only have the routines toUpper() and toLower(), and you
are trying to do a caseless comparison, then you have to use:
normalForm = toUpper(toLower(source)); // or toLower(toUpper(source));
This takes account of all of the characters that have many-to-one case
mappings, whether they be uppercase or lowercase (it is not just limited
You can also use the information in the Unicode character database to
generate a more efficient version of the above method.
Gary Roberts wrote:
> Thanks for the information, but I don't understand why this is
> important for a `case-folded' `loose comparison'.
> >From a user standpoint, they are asking for a case blind comparison.
> What characters do they want to be equal?
> For example, we have:
> U+0053 LATIN CAPITAL LETTER S
> U+0073 LATIN SMALL LETTER S
> U+017F LATIN SMALL LETTER LONG S
> 1. If I map to upper case, then these all map to U+0053, and are
> 2. If I map to lower case, then U+0053 is equal to U+0073, but these
> are different from U+017F.
> My intuitive understanding of case blind comparison agrees with 1,
> and would be surprised by 2.
> (One could argue that I should have already mapped U+017F to U+0073
> before ever considering case, but I am rather reluctant to do this,
> because case sensitive comparison is often used for exact matching.)
> So, mapping to upper case seems to provide what I would expect users
> to want (should any of our users ever have U+017F in their database
> ----- Begin Included Message -----
> From: email@example.com (Kenneth Whistler)
> Subject: Re: Case blind comparison
> > In The Unicode Standard, Version 2.0, section 4.1, it states that
> > "Because there are many more lowercase forms than there are
> > or titlecase, it is recommended that the lowercase form be used for
> > normalization, such as when strings are case-folded for loose
> > comparison or indexing." It appears to me that the uppercase version
> > should be used for exactly that reason. Can anyone explain to
> > me the advantage of mapping to lower case in general?
> The situation is as follows. There are many instances of lowercase
> Latin letters which do not have an uppercase form. There are
> no instances of uppercase Latin letters which do not have a lowercase
> form encoded as a character in Unicode.
> If you normalize to lowercase, then it will in general be true
> that forall c in s islower(c) is TRUE. Whereas, if you normalize
> to uppercase, then the corresponding statement is not true, i.e.
> (forall c in s isupper(s) is TRUE) is FALSE (or TRUE, depending).
> If you look at the history of Latin typography, in the more distant
> past it used to be that the majuscule was the unmarked form.
> ("unmarked" and "marked" here are used in their structuralist sense.)
> The miniscule forms were introduced in calligraphy and formed
> the basis for the lowercase in typography when case distinctions
> started to become a standard part of Western European language
> But in the modern era, the situation has reversed. It is the
> lowercase forms which are the unmarked forms of the letters.
> This has been deeply influenced by IPA, which greatly extended
> the range of lowercase baseform letters, without introducing
> corresponding uppercase letters (in principle, since case
> distinctions are irrelevant in phonetic transcription). When
> IPA was used as the basis for other standard Latin-based
> orthographies, especially in Africa, uppercase letters corresponding
> to some of the lowercase forms started to be invented, so that
> uppercase/lowercase orthographical conventions could be
> carried across to the new orthographies.
> I suspect that the engineering practice of normalizing to
> uppercase derives from the days of 5-bit and 6-bit codes.
> In 5-bit code (remember telegrams printed by teletypes?),
> there were *only* uppercase letters A-Z. If you are normalizing
> between a 5-bit code and a 6-bit code, you must uppercase,
> because that is the only case in common. This practice of
> normalizing to uppercase became a convention that could be
> carried harmlessly into 7-bit and 8-bit codes, since they
> always contained case pairs for the added letters. It basically
> was a harmless choice to go either way, and established
> convention ruled.
> Normalizing to uppercase should, however, be rethought now in
> the context of the Universal Character Set, where case pairs
> for all letters are not automatically available, and where
> normalizing to the unmarked case (lowercase) is the preferable
> --Ken Whistler
> ----- End Included Message -----
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT