Re: Case blind comparison

From: Gary Roberts (gar@sparc.SanDiegoCA.NCR.COM)
Date: Tue Jul 29 1997 - 20:12:06 EDT


Thanks for the information, but I don't understand why this is
important for a `case-folded' `loose comparison'.

From a user standpoint, they are asking for a case blind comparison.
What characters do they want to be equal?

For example, we have:

U+0053 LATIN CAPITAL LETTER S
U+0073 LATIN SMALL LETTER S
U+017F LATIN SMALL LETTER LONG S

1. If I map to upper case, then these all map to U+0053, and are therefore
equal.

2. If I map to lower case, then U+0053 is equal to U+0073, but these
are different from U+017F.

My intuitive understanding of case blind comparison agrees with 1,
and would be surprised by 2.

(One could argue that I should have already mapped U+017F to U+0073
 before ever considering case, but I am rather reluctant to do this,
 because case sensitive comparison is often used for exact matching.)

So, mapping to upper case seems to provide what I would expect users
to want (should any of our users ever have U+017F in their database
application).
                                *

----- Begin Included Message -----

From: kenw@sybase.com (Kenneth Whistler)
Subject: Re: Case blind comparison

>
> In The Unicode Standard, Version 2.0, section 4.1, it states that
> "Because there are many more lowercase forms than there are uppercase
> or titlecase, it is recommended that the lowercase form be used for
> normalization, such as when strings are case-folded for loose
> comparison or indexing." It appears to me that the uppercase version
> should be used for exactly that reason. Can anyone explain to
> me the advantage of mapping to lower case in general?

The situation is as follows. There are many instances of lowercase
Latin letters which do not have an uppercase form. There are
no instances of uppercase Latin letters which do not have a lowercase
form encoded as a character in Unicode.

If you normalize to lowercase, then it will in general be true
that forall c in s islower(c) is TRUE. Whereas, if you normalize
to uppercase, then the corresponding statement is not true, i.e.
(forall c in s isupper(s) is TRUE) is FALSE (or TRUE, depending).

If you look at the history of Latin typography, in the more distant
past it used to be that the majuscule was the unmarked form.
("unmarked" and "marked" here are used in their structuralist sense.)
The miniscule forms were introduced in calligraphy and formed
the basis for the lowercase in typography when case distinctions
started to become a standard part of Western European language
orthography.

But in the modern era, the situation has reversed. It is the
lowercase forms which are the unmarked forms of the letters.
This has been deeply influenced by IPA, which greatly extended
the range of lowercase baseform letters, without introducing
corresponding uppercase letters (in principle, since case
distinctions are irrelevant in phonetic transcription). When
IPA was used as the basis for other standard Latin-based
orthographies, especially in Africa, uppercase letters corresponding
to some of the lowercase forms started to be invented, so that
uppercase/lowercase orthographical conventions could be
carried across to the new orthographies.

I suspect that the engineering practice of normalizing to
uppercase derives from the days of 5-bit and 6-bit codes.
In 5-bit code (remember telegrams printed by teletypes?),
there were *only* uppercase letters A-Z. If you are normalizing
between a 5-bit code and a 6-bit code, you must uppercase,
because that is the only case in common. This practice of
normalizing to uppercase became a convention that could be
carried harmlessly into 7-bit and 8-bit codes, since they
always contained case pairs for the added letters. It basically
was a harmless choice to go either way, and established
convention ruled.

Normalizing to uppercase should, however, be rethought now in
the context of the Universal Character Set, where case pairs
for all letters are not automatically available, and where
normalizing to the unmarked case (lowercase) is the preferable
alternative.

--Ken Whistler

----- End Included Message -----



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT