L2/01-245

To:

W3C i18n committee

From:

UTC (ed. Mark Davis)

Date:

2000.06.06

Re:

Questions on characters permitted in XML Names

The following describes the UTC discussion at UTC Meeting #87 on name characters, with the W3C text bolded, and the reply unbolded.

The W3C I18N WG requests input from the UTC on how the XML specification should be revised in regard to characters permitted in names. In particular, we would welcome input on:

1. Why a policy of "anything goes" would be good/bad?

We presume that this means a policy that any characters are valid in names except for special syntax characters like space, "<", '"', '"', etc.

Advantages: Parsing can be somewhat simpler, and more importantly, is independent of Unicode version. When XML is viewed as a simply a structured protocol for exchange of data, this policy works well.

Disadvantages: If the XML is to be human-readable, there may be more occasion for error if variables are free-form. In particular, this goes for characters of type Z* or C*, which may cause problems because they are invisible. It may also be disconcerting to see XML such as the following.

<abc ☺="3">...

<█↔ ♀="5" ⅓="7" ₣="12">...

1a. Why a policy of "anything except what the XML specification has already excluded" would be good/bad?

This is very much like the previous question, except that the disconcerting cases would be limited to characters after a given cut-off point. If the goal is to minimize the disconcerting cases, then the latest existing version of Unicode at the time of the change to the XML specification should be used.

2. Why a policy of automatic linkage to versions of TUS (coupled with clear rules for choosing name characters) would be good/bad?

Automatic linkage could work, given something like the following strategy (after the next question).

3. How a policy of automatic linkage to versions of TUS could work?

[Note: we did not get into this level of detail in the UTC -- it is the product of later work in the editorial committee.]

There are different possibilities. Here is one way to look at the options.

Definitions

XML currently has the definitions:

 

[4]   

NameChar

   ::=   

Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender

[5]   

Name

   ::=   

() (NameChar)*

These can be viewed more simply for the following discussion as:

[4]   

NameChar

   ::=   

Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender

[4a]   

NameStart

   ::=   

Letter | '_' | ':'

[5]   

Name

   ::=   

NameStart (NameChar)*

For each version V of Unicode that is 3.1 or later, generate the sets: UX_NameStart[V], UX_NameChar[V], and Unassigned[V]

Now generate the XML name sets ("+" means set-union)

NameStart[V]  =

UX_NameStart + NameStart[previous(V)]

if V >= 3.1

 

NameStart as of XML 1.0

otherwise

NameChar[V]

UX_NameChar + NameChar[previous(V)]

if V >= 3.1

 

NameChar as of XML 1.0

otherwise

 

Note:

Because the new sets for each version never remove characters, the set of valid identifiers only grows, never contracts.

 

Note:

Of course, one would not actually expect parsers to go through this process. To make this easy for parser implementers, explicit data tables containing the complete sets of code points for each version should be publicly available either on the W3C site or the Unicode site.

Options

Call document is well-formed for Unicode Version V if and only if all of its names are in accordance with the above definition.

Using this, there are three options for imposing on parsers. In both options, if there is no Unicode declaration, the identifiers are as in XML 1.0. They only differ if there is a Unicode declaration for a version V, e.g. <?xml version="1.0" unicode_version="3.2" ?>.

In the following, let PV be the latest version of Unicode that the parser supports.

Option 1: Strict

This option has the following characteristics:

  1. V <= PV.
  2. V > PV.

Option 2: Lenient

This option has the following characteristics:

  1. V <= PV.
  2. V > PV.

Option 3: Very Lenient

This option has the following characteristics:

  1. V <= PV.
  2. V > PV.

Note:

For options 1 and 2, implementations must keep track of multiple versions of Unicode: 3.1, 3.2, 4.0,...
For option 3, they only need to keep track of one, PV.

4. What the rules for choosing name characters should be?

The current XML definitions (Appendix B Character Classes + [4], [5])are very close to what is provided in the Unicode guidelines, being based on the same Unicode properties. The main difference is that they remove compatibility decomposibles. They are reasonable as they stand.

5. What can be said, if anything, about the likelihood of suitable name characters in future versions of TUS?

It is absolutely 100% certain that there will be future characters that meet these criteria, and should be allowed in names, although their relative importance will certainly diminish.

6. Anything else the UTC can think of that would help resolve this matter.

Please note that this matter is urgent and we would very much like an early response from the UTC.

Many thanks,
Misha Wolf
W3C I18N WG Chair