Suggestion regarding identifiers in XML (Name and Nmtoken)
for the next version of XML


Kent Karlsson


Syntax for XML 1.0 identifiers

Name ::= (Letter | '_' | ':') (NameChar)*

Nmtoken ::= (NameChar)+

NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender

/* the short dash there is probably meant to be HYPHEN-MINUS... */

In annex B:

Letter ::= BaseChar | Ideographic                           [intent: {Lu}, {Ll}, {Lt}, {Lo}, {Nl}?]

BaseChar ::= /* long list in annex B */

Ideographic ::= /* list of ranges in annex B */

Digit ::= /* list of ranges in annex B */                      [intent: {Nd}, missing {No}]

CombiningChar ::= /* list in annex B */                    [includes {Mc}, {Mn}, {Me}(?)]

Extender ::= /* list in annex B */                                [includes {Lm}?]

[I have not analysed the lists in annex B, that is saved for a revised version of this document, but there is commentary in that annex about intent.]




Unicode identifier syntax recommendations:

Identifier ::= IdentifierStart (IdentifierStart | IdentifierExtend)*

IdentifierStart ::= {Lu} | {Ll} | {Lt} | {Lm} | {Lo} | {Nl}

IdentifierExtend ::= {Mn} | {Mc} | {Nd} | {Pc} | {Cf}



Suggestion for new syntax for XML (v. 2.0?) identifiers

Assume that the Unicode version is declared something like

            <?xml version="2.0" unicode-version="3.2"?>

for XML (v. 2.0(?) or later) documents, and that the identifier syntax (for XML v.2.0) is tied to Unicode character property values, rather than giving a list of code points directly in the syntax for XML (v. 2.0?).

Identifier (Name and Nmtoken) identity should be based on compatibility equivalence (see NFKC or NFKD), plus additional equivalence of all {Pd}, equivalence of all {Pc}, and equivalence of full stops. (If a subset of {Cf}s are allowed, they should be ignored for identifier name equality.)

This proposal does not make any greater effort at being backward compatible with strange edge cases that are allowed in XML 1.0. They have hopefully never been used anyway...

Name ::= NameStart (Connect? NmtokenStart)*

Nmtoken ::= NmtokenStart (Connect? NmtokenStart)*

NameStart ::= Letter | ':' | {Pc}

            /* [{Pc} generalises LOW LINE; move ':' and {Pc} to Connect!] */

NmtokenStart ::= NameStart | {Nd} | {No}

Letter ::= ({Ll} | {Lu} | {Lt} | {Lo} | {Lm} | {Nl}[??]) ({Mc} | {Mn})*

            /* [{Nl}s and other letters with compatibility decompositions will be

            NFK*ed away...; IDSes?; Hangul Jamo?; language independent

            grapheme syntax?] */

Connect ::= {Pd} | '.' | {Ideographic full stop}

            /* [{Pd} generalises HYPHEN-MINUS; true apostrophe?; middle dot?] */

            /* [move ':' and {Pc} here?] */


This is just a first suggestion, and comments are welcome.