Re: Questions on ZWNBS - for line initial holam plus alef

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Aug 14 2003 - 10:01:28 EDT

  • Next message: Patrick Andries: "Re: Handwritten EURO sign (off topic?)"

    ----- Original Message -----
    From: "Jon Hanna" <jon@spin.ie>
    To: <unicode@unicode.org>
    Sent: Thursday, August 14, 2003 1:49 PM
    Subject: RE: Questions on ZWNBS - for line initial holam plus alef

    > > I do agree: a XML document could require the use at some place of a
    > > given attribute or element. If this attribute name follows the
    element
    > > name
    > > after a line break, which gets changed into a space during parsing,
    > > forcing
    > > XML parsers to treat SPACE+combining as a unbreakable grapheme
    > > cluster acting like a letter would have the effect of creating a new
    > > element
    > > name which may violate the lement name identity. Now suppose that
    the
    > > attribute name contains a colon, you have created a custom namespace
    > > name, under which you can add any element you like, even if this was
    > > forbidden by the content-model of the reference schema.
    >
    > 1. SPACE is treated "blindly" as a SPACE by XML. String + space +
    combining
    > + string would not be treated as a single token, no matter how that
    space
    > was introduced. That's what you were complaining about in the first
    place
    > (as far as I can make out).
    > 2. While nmtokens can begin with a combining character names cannot,
    nor can
    > they contain spaces.
    > 3. This would in no way change the content-model. So even if the above
    two
    > points didn't hold they would only sneak the document past something
    which
    > performed validation before parsing(!), and where the content-model
    was
    > already pretty loose (so it didn't complain about the unrecognised
    > attribute).
    >
    > You've just discovered a way to disguise one document that isn't
    well-formed
    > as a different document that isn't well-formed. l33t!
    >
    > > So this would invalidate existing documents, or create holes
    allowing
    > > insertion of arbitrary XML content, if the XML application is not
    > > validating extremely strictly the element names (the pair namespace+
    > > name) and exclude completely from processing any unrecognized
    > > element (including all its content and attributes).
    >
    > This argument is not on friendly terms with the concept of causality.
    >
    > This would be a
    > > breach in the content model which may have been validated and tested
    > > for security in another layer of the document encoding process
    (notably
    > > when XML documents are created from templates, such as XSL
    > > processors, or custom C source using simple template substitution).
    >
    > Testing validity without testing well-formedness is not possible.
    >
    > > So for me the sequence SPACE+combining should not be acceptable
    > > as a valid grapheme cluster within element names or attribute names,
    >
    > As it already isn't.
    >
    > > and thus would need to be excluded from NMTOKEN. The correct
    > > way to do it is to consider it NOT A LETTER, but a symbol (Sk),
    > > exactly like other spacing diacritics, which are already invalid in
    > > NMTOKEN.
    >
    > Wait a second. That was my justification for why the fact that
    > space+combining is ALREADY prohibited from NMTOKEN shouldn't be
    considered a
    > failure on the part of XML to allow for freedom of choice with the
    strings
    > used for NMTOKENs. Now you actually want to introduce this (already
    > existent) feature.
    >
    > > There still remains the unresolved question of grapheme clusters
    > > that could span the starting "<" or ending ">" or "/>" of tags, or
    > > the leading "&" of a entitity reference.
    >
    > No there isn't. What goes before <, >, / or & isn't a problem since
    those
    > are all non-combining characters and a new unit for any sort of
    processing
    > treating more than one codepoint as a unit. What goes after < or & has
    to be
    > a name (not an nmtoken) and as such is already prohibited from
    beginning
    > with a combiner. What goes after > is already dealt with by the
    Charmod, and
    > even if you ignore charmod apart from the possibility of normalisation
    > turning the sequence U+003E, U+0338 into U+226E (a possibility that is
    well
    > noted) it still isn't going to hurt.

    One note: in Unicode, grapheme clusters (considered unbreakable) are
    more
    than just combining sequences! Look at CGJ, WJ, ZWJ, ...
    So what is after or *before* a base character may impact parsing
    grapheme clusters!

    As the well-formedness of XML documents goes even before its validity
    (which is optional, but required in some applications that need to parse
    the DOM-tree or InfoSet rather than), this impacts the way Unicode can
    be used (read it as "embedded") within XML. Depending on where this
    encoded text is used (NMTOKENs, text elements, attribute values,...)
    the embedding constraints will be different, but in my opinion anonymous
    text elements and attribute values should both use the same encoding
    capabilities as they both can (should be able to) represent any kind
    of valid Unicode plain text.

    As SPACE is handled differently in attribute values, this is a problem.
    that causes a problem for SPACE+NSM (considered valid but with
    imprecise properties for now).

    The constraints are less severe in anonymous text elements as there
    exists several technics (including CDATA sections) to represent them.
    In fact, XML will consider each text element or attribute value as an
    independant text, that should be independantly legal for Unicode,
    but the embedding thme in a well-formed and validated XML
    document may create a document that may now be ill-formed or
    parsed incorrectly with the Unicode analysis of grapheme clusters.

    My opinion about it is that there's an apparent conflict here which can
    prohibit using a fully conforming Unicode text editor for XML
    documents, only because of grapheme clusters boundaries, input
    methods, etc... When editing a XML document, the XML syntax
    MUST be wellformed at the first level, and then only the Unicode
    well-formness comes in second place.

    If there are places where both constraints cannot be made compatible
    with each other, either one can use a replacement encoding technic for
    Unicode, or XML must allow using an escaping mechanism to allow
    safe embedding of Unicode plain-text strings into the XML document,
    without needing any change in the generated InfoSet or DOM tree.

    If such escaping is impossible, then there's a hole in the XML syntax
    that will have one consequence: authors will start violating either the
    well-formedness of XML documents (using implementation-dependant
    tolerance "features"), or will use "defective" or even "illegal" Unicode
    sequences (which will appear as legal for a XML parser which considers
    the XML wellformedness an absolute requirement for security). In both
    cases, the XML documents will no longer be editable with plain-text
    editors, but with specific XML editors that don't parse and manage the
    document as a plain-text file in the first place.



    This archive was generated by hypermail 2.1.5 : Thu Aug 14 2003 - 10:42:36 EDT