You should normalize the full Unicode character range.
Because of the reversibility problems, most comparisons normalize to UPPER
case. For example:
ss = SS
SS = SS
? = SS
This works for all words and all locales.
However SS = ss or ? depending on the word when shifting to lowercase.
The Greek lowercase final sigma has a different Unicode character.
There was a reason that the Apple II did not support lower case ;-}
Now the hard part. Dotted and dottess i. If the locale uses the Turkish
dotless i then
i = I
i = I
Otherwise all other locales shift both i and i to I making the shift to
upper case non-reversible. However If the name is entered in mixed case
then the I will be preserved on capital letters.
Istambul = ISTAMBUL
If you did your compares in lower case then this would not work.
I = i
I = i
But you would change both to i and not distinguish between the two letters.
If you are shifting to lower case just to eliminate upper case characters
because the user is expected to use lowercase only, then this is another
matter. In any case consider implementing it for all Unicode characters
that have case.
From: email@example.com [mailto:firstname.lastname@example.org]
Sent: Thursday, September 28, 2000 8:43 AM
To: Unicode List
Subject: New Name Registry Using Unicode
Readers of this list may be interested in efforts to set up an upper-level
internet name registry (XNS) based on XML 1.0, Unicode 2.0, and Java 1.2,
which intends to allow names composed with a large subset of Unicode 2.0
Info is at
Below are details of the name specification from their "white papers":
In XNS 1.0, XNS personal, business, and general names all follow the same
Names can be up to 64 characters of XML text (Unicode 2.0 characters as
defined by the W3C XML 1.0 specification).
For purposes of name representation, all characters are legal except the
XNS global namespace prefix characters "=", "@", "+", the namespace
delimiter character "/", and the XML markup tag delimiter characters "<"
For purposes of name registration uniqueness, the only significant
characters are numbers and letter as defined by the Java isLetterOrDigit
function returning TRUE. This function determines if a character is a
letter or digit according to the Unicode 2.0 standard (category "Lu", "Ll",
"Lt", "Lm", "Lo", or "Nd" in the Unicode specification data file). For the
full specification, see Gosling, Joy, and Steele, The Java Language
Letters in the ASCII range are normalized to lower case. (In XNS 1.0, case
normalization is not applied in to any other Unicode character range.)
To illustrate these rules, the following name representations all normalize
to the same name:
John Doe, Jr.
John Doe Jr
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT