Re: Rationale wanted for Unicode identifier rules

From: Mark Davis (
Date: Sat Mar 04 2000 - 15:57:12 EST

In general, I agree with the discussion here: identifiers should be chosen on the basis of character properties. As new characters are assigned, they are given appropriate properties, and the class of possible identifiers grows. There are, however, difficulties with this approach in certain contexts.

Take, for example, XML identifiers. The difference in this case is that the identifiers occur in structured data, not program text. This data will live for years. The conformance requirements for XML identifiers are very strict. This is absolutely correct, since it guarantees compatibility around the world. But what this means is that the current, conformant XML parsers cannot accept new Unicode 3.0 letters in identifiers. There are a few main approaches to identifiers in XML, listed below.

[One note that is relevant to all of these: while <identifier_extend> includes Cf, these character should be filtered when composing identifiers, so there are actually 4 relevant categories for parsing identifiers. However, there are reserved blocks (2060-206F and E0000-E1000) now for Cf characters, so they should not present a problem.]

1. Status quo.
Never accept characters outside of Unicode 2.0 in identifiers. Downside: new scripts, and additions (e.g. CJK ideographs) to existing scripts are disadvantaged -- forever.

<identifier_start> := Lu, Ll, Lt, Lm, Lo, Nl -- as of Unicode 2.0
<identifier_extend> := Mn, Mc, Nd, Pc, Cf -- as of Unicode 2.0
<excluded> := Me, No, Zs, Zl, Zp, Cc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So -- as of Unicode 2.0

2. Successive upgrades.
Revise XML with each version of Unicode. This means you will have XML 1.0-compliant parsers, XML 1.1 compliant parsers, etc.. Disadvantage: it takes years for compliant parsers to be fully spread across the world. During that time, data interchange between different versions of parsers cannot be guaranteed. I believe this will be unacceptable to the XML community.

3. Open Season.
Define identifiers to be *fixed* as of Unicode 3.0, but to also include unassigned characters (Cn) as of that version.

Identifiers are thus fixed for all time. They include all new letters that will be defined. Disadvantage: they will also include new punctuation, symbols, etc. defined post-3.0.

<identifier_start> := Lu, Ll, Lt, Lm, Lo, Nl, AND Cn -- as of Unicode 3.0
<identifier_extend> := Mn, Mc, Nd, Pc, Cf, AND Cn -- as of Unicode 3.0
<excluded> := Me, No, Zs, Zl, Zp, Cc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So -- as of Unicode 3.0

4. Restricted Open Season.
The Unicode consortium divides up the unassigned space in more detail, and specifies that <excluded> characters and <identifier_extend> characters can only be allocated in the future within certain blocks. This has the effect of dividing Cn into subcategories: Cni, Cne, and Cnx. While characters will change from each of these to other properties over time as characters become assigned, the three relevant categories will remain unchanged.

Since future allocations will not disturb the identifier syntax, identifiers are thus fixed for all time. Disadvantage: the consortium as a whole has resisted such assignment of blocks for unassigned characters in the past (except Cf).

<identifier_start> := Lu, Ll, Lt, Lm, Lo, Nl, AND Cni
<identifier_extend> := Mn, Mc, Nd, Pc, Cf, AND Cne
<excluded> := Me, No, Zs, Zl, Zp, Cc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So, AND Cnx


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT