Re: Rationale wanted for Unicode identifier rules

From: John Cowan (
Date: Fri Mar 10 2000 - 17:47:43 EST

Mark Davis wrote:

> <&#x30CD;&#x30E0;>name</&#x30CD;&#x30E0;>
> is a valid segment of an XML 1.0 file, with English data, and the name of the data field is Japanese Katakana "NEMU".

In fact this is not well-formed XML (in XMLspeak, "valid" means something different),
because &#xNNNN; constructs are not allowed *within* identifiers. So what is
well-formed and what is not, in this particular case, depends on the charset.
In UTF-8 and UTF-16, which all XML parsers must understand, any legal XML
identifier can be used, but in ASCII, only
the ASCII subset is usable.
> 1. Status quo.
> Never accept characters outside of Unicode 2.0 in identifiers. Downside: new
> scripts, and additions (e.g. CJK ideographs) to existing scripts are disadvantaged
> -- forever.

I think this is unacceptable, and others agree.

> 2. Successive upgrades.
> Revise XML with each version of Unicode. This means you will have XML
> 1.0-compliant parsers, XML 1.1 compliant parsers, etc.. Disadvantage:
> it takes years for compliant parsers to be fully spread across the world.
> During that time, data interchange between different versions of parsers
> cannot be guaranteed. I believe this will be unacceptable to the XML community.

I think so too.

> 3. Open Season.
> Define identifiers to be *fixed* as of Unicode 3.0, but to also include
> unassigned characters (Cn) as of that version.

I think a modified version of this to be the best option, somewhat as

        Processes that accept XML MUST accept characters
                they believe to be unassigned.
        Processes that generate XML MUST NOT generate characters
                they believe to be unassigned.

With this scheme, a Unicode 3.0 acceptor will accept letters and digits
assigned in Unicode 4.0 and generated by a Unicode 4.0 process, but even
though it would accept symbols from Unicode 4.0, a Unicode 4.0 generator
process will never generate any.

> 4. Restricted Open Season.
> The Unicode consortium divides up the unassigned space in more detail,
> and specifies that <excluded> characters and <identifier_extend> characters
> can only be allocated in the future within certain blocks. This has the
> effect of dividing Cn into subcategories: Cni, Cne, and Cnx. While
> characters will change from each of these to other properties over time
> as characters become assigned, the three relevant categories will remain unchanged.

Nice if you can get it, but probably not available, given the tendency to
allocate script-specific puncts and symbols next to the letters and digits.


Schlingt dreifach einen Kreis vom dies! || John Cowan <> Schliesst euer Aug vor heiliger Schau, || Denn er genoss vom Honig-Tau, || Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT