L2/01-245
To: |
W3C i18n committee |
From: |
UTC (ed. Mark Davis) |
Date: |
2000.06.06 |
Re: |
Questions on characters permitted in XML Names |
The following describes the UTC discussion at UTC Meeting #87 on name characters, with the W3C text bolded, and the reply unbolded.
The W3C I18N WG requests input from the UTC on how the XML specification should
be revised in regard to characters permitted in names. In particular, we would
welcome input on:
1. Why a policy of "anything goes" would be good/bad?
We presume that this means a policy that any characters are valid in names
except for special syntax characters like space, "<", '"',
'"', etc.
Advantages: Parsing can be somewhat simpler, and more importantly, is
independent of Unicode version. When XML is viewed as a simply a structured
protocol for exchange of data, this policy works well.
Disadvantages: If the XML is to be human-readable, there may be more
occasion for error if variables are free-form. In particular, this goes for
characters of type Z* or C*, which may cause problems because they are
invisible. It may also be disconcerting to see XML such as the following.
<abc ☺="3">...
<█↔ ♀="5" ⅓="7" ₣="12">...
1a. Why a policy of "anything except what the XML specification has already excluded" would be good/bad?
This is very much like the previous question, except that the disconcerting cases
would be limited to characters after a given cut-off point. If the goal is to
minimize the disconcerting cases, then the latest existing version of Unicode
at the time of the change to the XML specification should be used.
2. Why a policy of automatic linkage to versions of TUS (coupled with clear
rules for choosing name characters) would be good/bad?
Automatic linkage could work, given something like the following strategy (after the next question).
3. How a policy of automatic linkage to versions of TUS could work?
[Note: we did not get into this level of detail in the UTC -- it is the product of later work in the editorial committee.]
There are different possibilities. Here is one way to look at the options.
XML currently has the definitions:
|
|||
[4] |
|
::= |
|
[5] |
|
::= |
|
These can be viewed more simply for the following discussion as:
[4] |
|
::= |
|
[4a] |
|
::= |
|
[5] |
|
::= |
|
For each version V of Unicode that is 3.1 or later, generate the sets: UX_NameStart[V], UX_NameChar[V], and Unassigned[V]
Now generate the XML name sets ("+" means set-union)
NameStart[V] = |
UX_NameStart + NameStart[previous(V)] |
if V >= 3.1 |
|
NameStart as of XML 1.0 |
otherwise |
NameChar[V] |
UX_NameChar + NameChar[previous(V)] |
if V >= 3.1 |
|
NameChar as of XML 1.0 |
otherwise |
Note: |
Because the new sets for each version never remove characters, the set of valid identifiers only grows, never contracts. |
Note: |
Of course, one would not actually expect parsers to go through this process. To make this easy for parser implementers, explicit data tables containing the complete sets of code points for each version should be publicly available either on the W3C site or the Unicode site. |
Call document is well-formed for Unicode Version V if and only if all of its names are in accordance with the above definition.
Using this, there are three options for imposing on parsers. In both options, if there is no Unicode declaration, the identifiers are as in XML 1.0. They only differ if there is a Unicode declaration for a version V, e.g. <?xml version="1.0" unicode_version="3.2" ?>.
In the following, let PV be the latest version of Unicode that the parser supports.
Option 1: Strict
This option has the following characteristics:
Option 2: Lenient
This option has the following characteristics:
Option 3: Very Lenient
This option has the following characteristics:
Note: |
For options 1 and 2, implementations must keep track of
multiple versions of Unicode: 3.1, 3.2, 4.0,... |
4. What the rules for choosing name characters should be?
The current XML definitions (Appendix B Character
Classes + [4], [5])are very close to what is provided in the Unicode
guidelines, being based on the same Unicode properties. The main difference is
that they remove compatibility decomposibles. They are reasonable as they
stand.
5. What can be said, if anything, about the likelihood of suitable name
characters in future versions of TUS?
It is absolutely 100% certain that there will be future characters that meet
these criteria, and should be allowed in names, although their relative
importance will certainly diminish.
6. Anything else the UTC can think of that would help resolve this matter.
Please note that this matter is urgent and we would very much like an early
response from the UTC.
Many thanks,
Misha Wolf
W3C I18N WG Chair