RE: New Name Registry Using Unicode

From: Marco.Cimarosti@icl.com
Date: Fri Sep 29 2000 - 04:34:51 EDT


tom@bluesky.org wrote:
> In XNS 1.0, XNS personal, business, and general names all
> follow the same normalization rules:

These normalization rules only work for ASCII, so why bother using Unicode?

After all, they can all keep on using ASCII (cmp.
http://www.trigeminal.com/samples/provincial.html).

> 1.
> Names can be up to 64 characters of XML text (Unicode 2.0
> characters as
> defined by the W3C XML 1.0 specification).

I think this means that text is normalized by *composition*, right?

This means that letters with diacritics will be handled as completely
different from their base letter. This would be a nightmare for languages
where diacritics have an "optional status". A few example of these funny
minority languages: English, Arabic, Italian, Hebrew (add also, e.g., French
and Spanish, if you consider the old deprecated usage of removing accents in
uppercase).

It means that, say, "www.co÷perate.ut" and "www.cooperate.ut" would be
considered as different names, which is certainly not what most users want.

A better choice, IMHO, would be to normalize by *decomposition*. In this
way, the problem above would be addressed by rule 3 below.

> 2.
> For purposes of name representation, all characters are legal
> except the
> XNS global namespace prefix characters "=", "@", "+", the namespace
> delimiter character "/", and the XML markup tag delimiter
> characters "<"
> and "">".

Shouldn't ":" be out as well? It acts as the separator for the port number.
How do you distinguish a name like "www.unicode.org:80" (where ":80" is part
of the name) from "www.unicode.org" with a ":80" suffix?

And how about "?" and "~"?

> 3.
> For purposes of name registration uniqueness, the only significant
> characters are numbers and letter as defined by the Java
> isLetterOrDigit
> function returning TRUE. This function determines if a character is a
> letter or digit according to the Unicode 2.0 standard
> (category "Lu", "Ll",
> "Lt", "Lm", "Lo", or "Nd" in the Unicode specification data
> file). For the
> full specification, see Gosling, Joy, and Steele, The Java Language
> Specification.

I think that a *much* more careful research should be carried on, regarding
what characters are to be considered "top significance", and which ones
shouldn't.

An example of characters that would be excluded from this rule:

- All vowels in Indian and South-East Asian languages! -- unless they
casually occur at the beginning of words, in which case they are "Lo".

- Indic viramas! -- Removing viramas in Indic alphabets is like adding
random "a"'s to Western text.

- Tibetan subscribed consonants! -- which are consonant on the same ground
of Tibetan "Lo"'s, just they happen not to not be *preceded* by vowel.

Moreover, why considering only "Nd" characters? All numerical ("N*")
characters represent numbers, and are significant to the same degree. I see
no reason why "www.number-1.ut" and "www.number-2.ut" should be considered
different names, while "www.number-I.ut" and www.number-II.ut" should be
considered the *same* name (www.number.ut)!

> 4.
> Letters in the ASCII range are normalized to lower case. (In
> XNS 1.0, case
> normalization is not applied in to any other Unicode character range.)

This is the nicest one!!

Why should ASCII (a *part* of the Latin alphabet) be any different from
other cased alphabets (the *rest* of the Latin alphabet, Greek, Cyrillic,
Armenian)!?

I don't think I need any further explanation or example about this last
point. Could you please explain the reason behind this last rule, if any?

_ Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT