L2/01-229 From: "John Cowan" Date: May 22, 2001 More input on Unicode in XML names And here's some more. This design considers backward compatibility essential, This is the technical content of a preliminary proposal for XML 1.0.1 names. It adds new "name start" and "name" characters (to re-adopt SGML terminology) to the XML 1.0 set to handle the extensions and corrections to Unicode between versions 2.0 and 3.1. This version can be used either as a "step" proposal or as the initial state of a "tracking Unicode" proposal. The W3C Core XML WG and the W3C I18N IG have seen this proposal, but not (yet) acted on it; it is at present a personal product of John Cowan (who belongs to both groups). The EXCLUSIONS override the rules given in the BASIC paragraphs, and the INCLUSIONS override the EXCLUSIONS. Rationales are given in the footnotes. BASIC NAME START CHARACTERS Unicode categories Lu (upper-case letters), Ll (lower-case letters), Lt (title-case letters), Lm (modifier letters), Lo (other letters), and Nl (numeric letters) [1] BASIC NAME CHARACTERS Unicode categories Mn (non-spacing combining marks), Mc (Indic vowel marks), Nd (numeric digits), and Pc (connector punctuation) [2] EXCLUSIONS Compatibility characters [3] Compatibility area characters (F900 to FFFD, 2F800 to 2FFFD) [3] Musical symbol combining characters (1D165 to 1D1AD) [4] NAME START INCLUSIONS U+003A COLON [5] U+005F LOW LINE [5] NAME INCLUSIONS U+002D HYPHEN-MINUS [5] U+002E FULL STOP [5] U+00B7 MIDDLE DOT [6] U+FA0E CJK COMPATIBILITY IDEOGRAPH-FA0E [7] U+FA0F CJK COMPATIBILITY IDEOGRAPH-FA0F [7] U+FA11 CJK COMPATIBILITY IDEOGRAPH-FA11 [7] U+FA13 CJK COMPATIBILITY IDEOGRAPH-FA13 [7] U+FA14 CJK COMPATIBILITY IDEOGRAPH-FA14 [7] U+FA1F CJK COMPATIBILITY IDEOGRAPH-FA1F [7] U+FA21 CJK COMPATIBILITY IDEOGRAPH-FA21 [7] U+FA23 CJK COMPATIBILITY IDEOGRAPH-FA23 [7] U+FA24 CJK COMPATIBILITY IDEOGRAPH-FA24 [7] U+FA27 CJK COMPATIBILITY IDEOGRAPH-FA27 [7] U+FA28 CJK COMPATIBILITY IDEOGRAPH-FA28 [7] U+FA29 CJK COMPATIBILITY IDEOGRAPH-FA29 [7] BACKWARD COMPATIBILITY NAME START CHARACTERS U+03D0 GREEK BETA SYMBOL [8] U+03D1 GREEK THETA SYMBOL [8] U+03D2 GREEK UPSILON WITH HOOK SYMBOL [8] U+03D5 GREEK PHI SYMBOL [8] U+03D6 GREEK PI SYMBOL [8] U+03F0 GREEK KAPPA SYMBOL [8] U+03F1 GREEK RHO SYMBOL [8] U+03F2 GREEK LUNATE SIGMA SYMBOL [8] U+0675 ARABIC LETTER HIGH HAMZA ALEF [8] U+0676 ARABIC LETTER HIGH HAMZA WAW [8] U+0677 ARABIC LETTER U WITH HAMZA ABOVE [8] U+0678 ARABIC LETTER HIGH HAMZA YEH [8] U+0E33 THAI CHARACTER SARA AM [9] U+0EB3 LAO VOWEL SIGN AM [9] U+0F77 TIBETAN VOWEL SIGN VOCALIC RR [9] U+0F79 TIBETAN VOWEL SIGN VOCALIC LL [9] U+1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING [9] U+212E ESTIMATED SYMBOL [8] BACKWARD COMPATIBILITY NAME CHARACTERS U+0387 GREEK ANO TELEIA [10] U+06DD ARABIC END OF AYAH [11] U+06DE ARABIC START OF RUB EL HIZB [11] Footnotes: [1] This is the set of Unicode 3.1 identifier start characters. It includes the Lm (modifier letter) category, which were permitted only as name start characters in XML 1.0. However, the Unicode Consortium has since clarified that modifier letters appear at the beginnings of words in some natural languages. [2] This is most of the Unicode 3.1 identifier part characters. The Pc (connector punctuation) category, which was not used in XML 1.0, brings in U+203F UNDERTIE, U+2040 CHARACTER TIE, and U+30FB KATAKANA MIDDLE DOT. (It also includes "_", but that is promoted to a name start character by the INCLUSIONS.) The Cf (formatting) characters which are allowed in Unicode 3.1 identifiers, but which are supposed to be filtered out before comparisons are made, are inappropriate for XML because of its exact-comparison rules. [3] Compatibility characters (that is, characters with compatibility decompositions), as well as characters in the compatibility area, are excluded by XML 1.0, and this proposal continues to exclude them, now using the Unicode 3.1 compatibility decompositions. The new Plane 2 compatibility ideographs are likewise excluded. Where a character has been moved from non-compatibility in Unicode 2.0 to compatibility in Unicode 3.0, it is explicitly specified in the INCLUSIONS. [4] These newly introduced combining characters are meant to be used only with musical symbols, not with letters, so they are excluded. [5] These characters are expressly included by XML 1.0. Note that LOW LINE is the ISO/Unicode name for "_". [6] U+00B7 MIDDLE DOT is primarily a punctuation character, but in the Catalan language functions as a modifier letter. It is therefore included, as was done in XML 1.0 as well. [7] These ideographs are not and never were compatibility ideographs. As noted in Unicode 3.0, they constitute a small set which was originally excluded from the main block of ideographs because they appeared only in industry character sets rather than in official national character sets. [8] These characters are no longer considered letters in Unicode 3.1, but symbols. However, they are still allowed for backward compatibility with XML 1.0. [9] These letters are given compatibility decompositions in Unicode 3.1, whereas they had canonical decompositions in Unicode 2.0. They would therefore now be excluded by the EXCLUSIONS, but are retained for backward compatibility with XML 1.0. [10] U+0387 GREEK ANO TELEIA is a punctuation character in Greek contexts, but it is canonically equivalent to MIDDLE DOT, and so was included in XML 1.0. It is retained solely for backward compatibility with XML 1.0. [11] The Unicode Me category (enclosing combining marks) was permitted in XML 1.0 name characters, although most of its members were in fact excluded in one way or another. Enclosing combining marks are primarily for use with symbols, but these few are retained solely for backward compatibility with XML 1.0. -- John Cowan cowan@ccil.org 3 1