Re: Questions on ZWNBS - for line initial holam plus alef

From: Philippe Verdy
Date: Wed Aug 13 2003 - 14:09:04 EDT

    From: "Peter Kirk" <>

    > There is some potential for real trouble here, if one process outputs
    > NMTOKEN starting with a combining character preceded by a separating
    > space, or something else which is changed into a space, and another
    > process takes the new space plus combining character as a unit and so
    > doesn't recognise the separation. Any hackers and virus programmers
    > reading this will soon start flooding the Internet with tokens
    > with combining characters in the hope of crashing implementations or
    > finding back doors. Of course this wouldn't have been a problem if
    > Unicode had never defined space plus combining character as legal and
    > meaningful. But this is not my problem!

    I do agree: a XML document could require the use at some place of a
    given attribute or element. If this attribute name follows the element
    after a line break, which gets changed into a space during parsing,
    XML parsers to treat SPACE+combining as a unbreakable grapheme
    cluster acting like a letter would have the effect of creating a new
    name which may violate the lement name identity. Now suppose that the
    attribute name contains a colon, you have created a custom namespace
    name, under which you can add any element you like, even if this was
    forbidden by the content-model of the reference schema.

    So this would invalidate existing documents, or create holes allowing
    insertion of arbitrary XML content, if the XML application is not
    validating extremely strictly the element names (the pair namespace+
    name) and exclude completely from processing any unrecognized
    element (including all its content and attributes). This would be a
    breach in the content model which may have been validated and tested
    for security in another layer of the document encoding process (notably
    when XML documents are created from templates, such as XSL
    processors, or custom C source using simple template substitution).

    So for me the sequence SPACE+combining should not be acceptable
    as a valid grapheme cluster within element names or attribute names,
    and thus would need to be excluded from NMTOKEN. The correct
    way to do it is to consider it NOT A LETTER, but a symbol (Sk),
    exactly like other spacing diacritics, which are already invalid in

    There still remains the unresolved question of grapheme clusters
    that could span the starting "<" or ending ">" or "/>" of tags, or
    the leading "&" of a entitity reference. For this reason, defective
    combining sequences (combining characters without a leading base
    character) should be forbidden (invalid for XML).

    So there remains a unsolved conflict here: defective combining
    sequences cause security or validity problems in XML documents,
    and a non-defective SPACE+combining sequence cause also
    security problems. There's no secure choice to represent
    spacing diacritics which are not already encoded in a precomposed

