UTC response to WG2

L2/01-245

To:	W3C i18n committee
From:	UTC (ed. Mark Davis)
Date:	2000.06.06
Re:	Questions on characters permitted in XML Names

The following describes the UTC discussion at UTC Meeting #87 on name characters, with the W3C text bolded, and the reply unbolded.

The W3C I18N WG requests input from the UTC on how the XML specification should be revised in regard to characters permitted in names. In particular, we would welcome input on:

1. Why a policy of "anything goes" would be good/bad?

We presume that this means a policy that any characters are valid in names except for special syntax characters like space, "<", '"', '"', etc.

Advantages: Parsing can be somewhat simpler, and more importantly, is independent of Unicode version. When XML is viewed as a simply a structured protocol for exchange of data, this policy works well.

Disadvantages: If the XML is to be human-readable, there may be more occasion for error if variables are free-form. In particular, this goes for characters of type Z* or C*, which may cause problems because they are invisible. It may also be disconcerting to see XML such as the following.

<abc ☺="3">...

<█↔ ♀="5" ⅓="7" ₣="12">...

1a. Why a policy of "anything except what the XML specification has already excluded" would be good/bad?

This is very much like the previous question, except that the disconcerting cases would be limited to characters after a given cut-off point. If the goal is to minimize the disconcerting cases, then the latest existing version of Unicode at the time of the change to the XML specification should be used.

2. Why a policy of automatic linkage to versions of TUS (coupled with clear rules for choosing name characters) would be good/bad?

Automatic linkage could work, given something like the following strategy (after the next question).

3. How a policy of automatic linkage to versions of TUS could work?

[Note: we did not get into this level of detail in the UTC -- it is the product of later work in the editorial committee.]

There are different possibilities. Here is one way to look at the options.

Definitions

XML currently has the definitions:


[4]	`NameChar`	::=	`Letter \| Digit \| '.' \| '-' \| '_' \| ':' \| CombiningChar \| Extender`
[5]	`Name`	::=	`() (NameChar)*`

These can be viewed more simply for the following discussion as:

[4]	`NameChar`	::=	`Letter \| Digit \| '.' \| '-' \| '_' \| ':' \| CombiningChar \| Extender`
[4a]	`NameStart`	::=	`Letter \| '_' \| ':'`
[5]	`Name`	::=	`NameStart (NameChar)*`

For each version V of Unicode that is 3.1 or later, generate the sets: UX_NameStart[V], UX_NameChar[V], and Unassigned[V]

The last is simply the list of code points that are unassigned in V
The generation of UX_NameStart and UX_NameChar is based on Appendix B Character Classes + [4], [4a], [5]

Only characters that are newly assigned in V are included in the first two.
This condition is optional; it allows one to only consider major.minor versions of Unicode, since those are the only ones that introduce new characters.

Now generate the XML name sets ("+" means set-union)

NameStart[V] =	UX_NameStart + NameStart[previous(V)]	if V >= 3.1
	NameStart as of XML 1.0	otherwise
NameChar[V]	UX_NameChar + NameChar[previous(V)]	if V >= 3.1
	NameChar as of XML 1.0	otherwise

Note:

Because the new sets for each version never remove characters, the set of valid identifiers only grows, never contracts.

Note:

Of course, one would not actually expect parsers to go through this process. To make this easy for parser implementers, explicit data tables containing the complete sets of code points for each version should be publicly available either on the W3C site or the Unicode site.

Options

Call document is well-formed for Unicode Version V if and only if all of its names are in accordance with the above definition.

Using this, there are three options for imposing on parsers. In both options, if there is no Unicode declaration, the identifiers are as in XML 1.0. They only differ if there is a Unicode declaration for a version V, e.g. <?xml version="1.0" unicode_version="3.2" ?>.

In the following, let PV be the latest version of Unicode that the parser supports.

Option 1: Strict

Let VM be min(V, PV).
The parser must reject the document if the names do not parse in accordance with the above definitions of NameStart[VM] and NameChar[VM].

This option has the following characteristics:

V <= PV.

All well-formed documents for V are accepted
All ill-formed documents for V are rejected.

V > PV.

Not all well-formed documents for V are rejected

The well-formed documents that are rejected are those containing characters not yet assigned in PV, but that are well-formed in V.

All ill-formed documents for V are rejected.

Option 2: Lenient

Let VM be min(V, PV).
The parser must reject the document if the names do not parse in accordance with the following definitions of ParserNameStart and ParserNameChar.

ParserNameStart = NameStart[VM] + Unassigned[VM]
ParserNameChar = NameChar[VM] + Unassigned[VM]

This option has the following characteristics:

V <= PV.

All well-formed documents for V are accepted
All ill-formed documents for V are rejected.

V > PV.

All well-formed documents for V are accepted
Not all ill-formed documents for V are rejected.

The ill-formed documents that are accepted are those containing characters not yet assigned in PV, but that are ill-formed in V.

Option 3: Very Lenient

Let VM be PV, otherwise as in Option 2.

This option has the following characteristics:

V <= PV.

All well-formed documents for V are accepted
Not all ill-formed documents for V are rejected.

The ill-formed documents that are accepted are those containing characters not yet assigned in V, but that are ill-formed in PV.

V > PV.

All well-formed documents for V are accepted
Not all ill-formed documents for V are rejected.

The ill-formed documents that are accepted are those containing characters not yet assigned in PV, but that are ill-formed in V.

Note:

For options 1 and 2, implementations must keep track of multiple versions of Unicode: 3.1, 3.2, 4.0,...
For option 3, they only need to keep track of one, PV.

4. What the rules for choosing name characters should be?

The current XML definitions (Appendix B Character Classes + [4], [5])are very close to what is provided in the Unicode guidelines, being based on the same Unicode properties. The main difference is that they remove compatibility decomposibles. They are reasonable as they stand.

5. What can be said, if anything, about the likelihood of suitable name characters in future versions of TUS?

It is absolutely 100% certain that there will be future characters that meet these criteria, and should be allowed in names, although their relative importance will certainly diminish.

6. Anything else the UTC can think of that would help resolve this matter.

Please note that this matter is urgent and we would very much like an early response from the UTC.

Many thanks,
Misha Wolf
W3C I18N WG Chair