RE: New Name Registry Using Unicode

Date: Fri Sep 29 2000 - 04:34:51 EDT wrote:
> In XNS 1.0, XNS personal, business, and general names all
> follow the same normalization rules:

These normalization rules only work for ASCII, so why bother using Unicode?

After all, they can all keep on using ASCII (cmp.

> 1.
> Names can be up to 64 characters of XML text (Unicode 2.0
> characters as
> defined by the W3C XML 1.0 specification).

I think this means that text is normalized by *composition*, right?

This means that letters with diacritics will be handled as completely
different from their base letter. This would be a nightmare for languages
where diacritics have an "optional status". A few example of these funny
minority languages: English, Arabic, Italian, Hebrew (add also, e.g., French
and Spanish, if you consider the old deprecated usage of removing accents in

It means that, say, "÷perate.ut" and "www.cooperate.ut" would be
considered as different names, which is certainly not what most users want.

A better choice, IMHO, would be to normalize by *decomposition*. In this
way, the problem above would be addressed by rule 3 below.

> 2.
> For purposes of name representation, all characters are legal
> except the
> XNS global namespace prefix characters "=", "@", "+", the namespace
> delimiter character "/", and the XML markup tag delimiter
> characters "<"
> and "">".

Shouldn't ":" be out as well? It acts as the separator for the port number.
How do you distinguish a name like "" (where ":80" is part
of the name) from "" with a ":80" suffix?

And how about "?" and "~"?

> 3.
> For purposes of name registration uniqueness, the only significant
> characters are numbers and letter as defined by the Java
> isLetterOrDigit
> function returning TRUE. This function determines if a character is a
> letter or digit according to the Unicode 2.0 standard
> (category "Lu", "Ll",
> "Lt", "Lm", "Lo", or "Nd" in the Unicode specification data
> file). For the
> full specification, see Gosling, Joy, and Steele, The Java Language
> Specification.

I think that a *much* more careful research should be carried on, regarding
what characters are to be considered "top significance", and which ones

An example of characters that would be excluded from this rule:

- All vowels in Indian and South-East Asian languages! -- unless they
casually occur at the beginning of words, in which case they are "Lo".

- Indic viramas! -- Removing viramas in Indic alphabets is like adding
random "a"'s to Western text.

- Tibetan subscribed consonants! -- which are consonant on the same ground
of Tibetan "Lo"'s, just they happen not to not be *preceded* by vowel.

Moreover, why considering only "Nd" characters? All numerical ("N*")
characters represent numbers, and are significant to the same degree. I see
no reason why "www.number-1.ut" and "www.number-2.ut" should be considered
different names, while "www.number-I.ut" and www.number-II.ut" should be
considered the *same* name (www.number.ut)!

> 4.
> Letters in the ASCII range are normalized to lower case. (In
> XNS 1.0, case
> normalization is not applied in to any other Unicode character range.)

This is the nicest one!!

Why should ASCII (a *part* of the Latin alphabet) be any different from
other cased alphabets (the *rest* of the Latin alphabet, Greek, Cyrillic,

I don't think I need any further explanation or example about this last
point. Could you please explain the reason behind this last rule, if any?

_ Marco

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT