(You replied privately; was this intentional? If not, you can resend it to
the list, and I will re-send this one).
> >A better choice, IMHO, would be to normalize by *decomposition*. In this
> >way, the problem above would be addressed by rule 3 below.
> I think you have a very good point. This occurred to me also. The
> I could not answer is what locale do I use? What normalization rules do I
You can use *no* locale. We are not talking about normal text, but about
identifiers of Internet sites. The conversion must therefore be uniform for
all the world.
The normalization should be a multi-step process.
For the first step, I see only one alternative: *compatibility*
*decomposition*, that is part of the Unicode standard and is not bound to
any specific language of locale. *Canonical* decomposition is out of place,
because the goal here is not preserving text (no one will see the result of
normalization, anyway), but maximizing matches.
In the second step, all characters that are not essential should be trimmed
out. This includes spaces, punctuation, and character not normally read
aloud (e.g. trademark symbol, etc.) This is includes all diacritic marks
that can be avoided (and this is where the problems pop in, as you notice,
because the same diacritic may be essential to a language but optional to
The third step should be a further cut-off of differences. The main part of
it would be case- and kana- folding (drop the difference between uppercase
and lowercase, and between katakana and hiragana).
But the last step should go a little bit forward than this: all character
that "look the same" must be unified, for obvious reason. It would be a
suicide, for instance, to allow Cyrillic letters like a, B, c, e, H, i, j,
K, M, n, o, p, s, T, u, x, or y to be distinguished from the Latin letters
by the same shape. People could use this to forge fraudulent web sites
(e.g., www.unicode.org, where one or both the two "o"'s and the "e" are
> If we can't even do case shifting with out a locale. (The Turkish dotless
> and dotted ?) How can we decide what is a letter? If ü = u then is å =
> How about ñ = n?
> The problems is that there is no easy solution. It might be part of the
> Danes inherent good humor to start and end their alphabet with letter a
> they won't think it is funny to change æ to ae, ø to o or å to a. Like
> Vietnamese letter â is a letter where in most languages the circumflex is
I see your point, but you should keep in mind that nobody (apart, maybe,
implementers and administrators of DNS servers) will ever see the result of
this "normalization". So we don't have any display or spell-checking problem
Your example with Danish "æ" being converted to "ae" is one that I wanted to
use to defend the opposite point of view!
How many Danish words contain an "a" followed by an "e"? And, which is more
important, how many *pairs* of Danish words are distinguished only by the
fact that the first one contains an "æ" where the second one contains "ae"
sequence? And how many of these minimal pairs would create a problem in a
If the answers to all these (rhetoric) questions is what I think, then it is
perfectly OK to convert "æ" to "ae".
However, your other examples are not as straightforward. For most languages,
it would be crazy to maintain the ^ on "â" (in modern French or Portuguese,
for instance, the ^ accent carries almost no phonetic significance, so it is
very likely that people may omit it in informal typing), but in a language
like Vietnamese the presence/absence of "^" makes up hundreds (or
thousands?) of minimal pairs, and it would be very annoying if company
"Tieng Viêt" could not register their domain because company "Tiêng Viet"
already did it!
But using locale is by no means a solution (how do you tag a domain name
with the proper locale information to drive an ad-hoc normalization?).
So, I am afraid that a compromise has to be sought, and it will have to
sacrifice something (e.g., the distinction between dotted and dotless "i"!).
Whatever limitation, it will however be better than the proposal by XNS.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT