Re: New Name Registry Using Unicode

From: Mark Davis (mark@macchiato.com)
Date: Mon Oct 02 2000 - 12:40:15 EDT


There are a number of similarities between this XNS and IDN, so
http://www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-00.txt would be
worth reading.

On locales: using them is dangerous for matching. The only reason to add
locale is if it were to make a difference which letters match. But that
opens up a huge can of worms, since non-uniform matches will make the
results unpredictable. For example, having the matching conventions differ
whether the locale is French or not would mean having to always guess what
the locale of the server hosting that name, and puts a huge burden on the
resolvers (if XNS works anything like DNS). . When I see a name on a
billboard, how do I guess what the locale is?

Moreover, we only have to look at the number of pages that have untagged
languages and character sets to see that incorrect locale tagging will
happen. The safest approach is to have only the characters in the name
itself be significant for matching, and have a uniform "folding" of
characters for matching purposes. (see
http://www.unicode.org/unicode/reports/tr21/charts/)

On matching: as in nameprep, you need to check before performing
normalization. Since some characters change behavior when normalized, one
would, for example, want to check for the "" before decomposing it. NFKC is
most useful when the input character domain is limited to letters, marks,
punctuation and numbers (general categories L*, M*, Nd, Nl, P*: see
http://www.unicode.org/unicode/reports/tr24/charts/ and
http://www.unicode.org/unicode/reports/tr15/charts/).

On confusables. The letters with confusable shapes are a bit up in the air:
>a, B, c, e, H, i, j, K, M, n, o, p, s, T, u, x, or y
If the folding is only used for the purpose of matching, and the "canonical"
name on the server retains the original characters, then this could be done.
There is, however, no definitive list across all Unicode characters in the
input domain of such confuseables.

On Turkish i. With a uniform case folding, "I" and "i" must fold together.
For the Turkish i's, you only have two choices. (Note: the sample words
below are made up for illustration, apologies if they happen to be
inappropriate words in Turkish.)

Option A. Fold dotted uppercase I (I) and dotless lowercase i (i) together
with regular i and I.

Upside: casing works for Turkish
    "BIT" matches "bit", and "BIT" matches "bit".
Downside: dotless i is not distinguished from dotted i in registration.
    you cannot register two distinct names "bit" and "bit"

Option B. Don't fold dotted uppercase I (I) and dotless lowercase i (i)
together with regular i and I.

Downside: casing doesn't work for Turkish:
    "BIT" does not match "bit", and "BIT" does not match "bit", even though
"BET" matches "bet"
Upside: dotless i is distinguished from dotted i in registration.
    you can register two distinct names "bit" and "bit"

The informal feedback I had gotten was that (given only these options!)
Turks would prefer (A) over (B). If people canvas their Turkish collegues,
that would be more information.

----- Original Message -----
From: "Carl W. Brown" <cbrown@xnetinc.com>
To: "Unicode List" <unicode@unicode.org>
Sent: Monday, October 02, 2000 07:55
Subject: RE: New Name Registry Using Unicode

> Marco,
>
> It would certainly seem that the optimal solution would be to carry the
> locale.
>
> Then you normalize according to the rules of the locale. Besides the
locale
> could aid in the search. You would only have to be unique for your
locale.
>
> The drawback is that every search engine would have to be Unicode smart.
> But have many XNS servers will there be? If it is like DNS the you could
> make the locale the major key and route to the appropriate XNS server to
> process. It would provide a natural way to segment the database. This
> would be efficient since most traffic is intra-locale.
>
> Carl
>
>
> -----Original Message-----
> From: Marco.Cimarosti@icl.com [mailto:Marco.Cimarosti@icl.com]
> Sent: Monday, October 02, 2000 6:53 AM
> To: Unicode List
> Subject: RE: New Name Registry Using Unicode
>
>
> Hi, Carl.
>
> (You replied privately; was this intentional? If not, you can resend it to
> the list, and I will re-send this one).
>
> > >A better choice, IMHO, would be to normalize by *decomposition*. In
this
> > >way, the problem above would be addressed by rule 3 below.
>
> > I think you have a very good point. This occurred to me also. The
> question
> > I could not answer is what locale do I use? What normalization rules do
I
> > use?
>
> You can use *no* locale. We are not talking about normal text, but about
> identifiers of Internet sites. The conversion must therefore be uniform
for
> all the world.
>
> The normalization should be a multi-step process.
>
> For the first step, I see only one alternative: *compatibility*
> *decomposition*, that is part of the Unicode standard and is not bound to
> any specific language of locale. *Canonical* decomposition is out of
place,
> because the goal here is not preserving text (no one will see the result
of
> normalization, anyway), but maximizing matches.
>
> In the second step, all characters that are not essential should be
trimmed
> out. This includes spaces, punctuation, and character not normally read
> aloud (e.g. trademark symbol, etc.) This is includes all diacritic marks
> that can be avoided (and this is where the problems pop in, as you notice,
> because the same diacritic may be essential to a language but optional to
> another).
>
> The third step should be a further cut-off of differences. The main part
of
> it would be case- and kana- folding (drop the difference between uppercase
> and lowercase, and between katakana and hiragana).
>
> But the last step should go a little bit forward than this: all character
> that "look the same" must be unified, for obvious reason. It would be a
> suicide, for instance, to allow Cyrillic letters like a, B, c, e, H, i, j,
> K, M, n, o, p, s, T, u, x, or y to be distinguished from the Latin letters
> by the same shape. People could use this to forge fraudulent web sites
> (e.g., www.unicode.org, where one or both the two "o"'s and the "e" are
> Cyrillic!)
>
> > If we can't even do case shifting with out a locale. (The Turkish
dotless
> ?
> > and dotted ?) How can we decide what is a letter? If = u then is =
> a.
> > How about = n?
> >
> > The problems is that there is no easy solution. It might be part of the
> > Danes inherent good humor to start and end their alphabet with letter a
> but
> > they won't think it is funny to change to ae, to o or to a. Like
> the
> > Vietnamese letter is a letter where in most languages the circumflex
is
> an
> > accent.
>
> I see your point, but you should keep in mind that nobody (apart, maybe,
> implementers and administrators of DNS servers) will ever see the result
of
> this "normalization". So we don't have any display or spell-checking
problem
> here.
>
> Your example with Danish "" being converted to "ae" is one that I wanted
to
> use to defend the opposite point of view!
>
> How many Danish words contain an "a" followed by an "e"? And, which is
more
> important, how many *pairs* of Danish words are distinguished only by the
> fact that the first one contains an "" where the second one contains "ae"
> sequence? And how many of these minimal pairs would create a problem in a
> server name?
>
> If the answers to all these (rhetoric) questions is what I think, then it
is
> perfectly OK to convert "" to "ae".
>
> However, your other examples are not as straightforward. For most
languages,
> it would be crazy to maintain the ^ on "" (in modern French or
Portuguese,
> for instance, the ^ accent carries almost no phonetic significance, so it
is
> very likely that people may omit it in informal typing), but in a language
> like Vietnamese the presence/absence of "^" makes up hundreds (or
> thousands?) of minimal pairs, and it would be very annoying if company
> "Tieng Vit" could not register their domain because company "Ting Viet"
> already did it!
>
> But using locale is by no means a solution (how do you tag a domain name
> with the proper locale information to drive an ad-hoc normalization?).
>
> So, I am afraid that a compromise has to be sought, and it will have to
> sacrifice something (e.g., the distinction between dotted and dotless
"i"!).
> Whatever limitation, it will however be better than the proposal by XNS.
>
> Marco
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT