RE: Questions on ZWNBS - for line initial holam plus alef

From: Jon Hanna (
Date: Thu Aug 14 2003 - 07:49:24 EDT

  • Next message: Marco Cimarosti: "RE: Handwritten EURO sign (off topic?)"

    > I do agree: a XML document could require the use at some place of a
    > given attribute or element. If this attribute name follows the element
    > name
    > after a line break, which gets changed into a space during parsing,
    > forcing
    > XML parsers to treat SPACE+combining as a unbreakable grapheme
    > cluster acting like a letter would have the effect of creating a new
    > element
    > name which may violate the lement name identity. Now suppose that the
    > attribute name contains a colon, you have created a custom namespace
    > name, under which you can add any element you like, even if this was
    > forbidden by the content-model of the reference schema.

    1. SPACE is treated "blindly" as a SPACE by XML. String + space + combining
    + string would not be treated as a single token, no matter how that space
    was introduced. That's what you were complaining about in the first place
    (as far as I can make out).
    2. While nmtokens can begin with a combining character names cannot, nor can
    they contain spaces.
    3. This would in no way change the content-model. So even if the above two
    points didn't hold they would only sneak the document past something which
    performed validation before parsing(!), and where the content-model was
    already pretty loose (so it didn't complain about the unrecognised

    You've just discovered a way to disguise one document that isn't well-formed
    as a different document that isn't well-formed. l33t!

    > So this would invalidate existing documents, or create holes allowing
    > insertion of arbitrary XML content, if the XML application is not
    > validating extremely strictly the element names (the pair namespace+
    > name) and exclude completely from processing any unrecognized
    > element (including all its content and attributes).

    This argument is not on friendly terms with the concept of causality.

     This would be a
    > breach in the content model which may have been validated and tested
    > for security in another layer of the document encoding process (notably
    > when XML documents are created from templates, such as XSL
    > processors, or custom C source using simple template substitution).

    Testing validity without testing well-formedness is not possible.

    > So for me the sequence SPACE+combining should not be acceptable
    > as a valid grapheme cluster within element names or attribute names,

    As it already isn't.

    > and thus would need to be excluded from NMTOKEN. The correct
    > way to do it is to consider it NOT A LETTER, but a symbol (Sk),
    > exactly like other spacing diacritics, which are already invalid in
    > NMTOKEN.

    Wait a second. That was my justification for why the fact that
    space+combining is ALREADY prohibited from NMTOKEN shouldn't be considered a
    failure on the part of XML to allow for freedom of choice with the strings
    used for NMTOKENs. Now you actually want to introduce this (already
    existent) feature.

    > There still remains the unresolved question of grapheme clusters
    > that could span the starting "<" or ending ">" or "/>" of tags, or
    > the leading "&" of a entitity reference.

    No there isn't. What goes before <, >, / or & isn't a problem since those
    are all non-combining characters and a new unit for any sort of processing
    treating more than one codepoint as a unit. What goes after < or & has to be
    a name (not an nmtoken) and as such is already prohibited from beginning
    with a combiner. What goes after > is already dealt with by the Charmod, and
    even if you ignore charmod apart from the possibility of normalisation
    turning the sequence U+003E, U+0338 into U+226E (a possibility that is well
    noted) it still isn't going to hurt.

    This archive was generated by hypermail 2.1.5 : Thu Aug 14 2003 - 09:29:15 EDT