RE: Questions on ZWNBS - for line initial holam plus alef

From: Jon Hanna (jon@spin.ie)
Date: Thu Aug 14 2003 - 07:49:24 EDT

Next message: Marco Cimarosti: "RE: Handwritten EURO sign (off topic?)"

Previous message: John Cowan: "Re: Handwritten EURO sign (off topic?)"
In reply to: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"
Next in thread: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"
Reply: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> I do agree: a XML document could require the use at some place of a
> given attribute or element. If this attribute name follows the element
> name
> after a line break, which gets changed into a space during parsing,
> forcing
> XML parsers to treat SPACE+combining as a unbreakable grapheme
> cluster acting like a letter would have the effect of creating a new
> element
> name which may violate the lement name identity. Now suppose that the
> attribute name contains a colon, you have created a custom namespace
> name, under which you can add any element you like, even if this was
> forbidden by the content-model of the reference schema.

1. SPACE is treated "blindly" as a SPACE by XML. String + space + combining
+ string would not be treated as a single token, no matter how that space
was introduced. That's what you were complaining about in the first place
(as far as I can make out).
2. While nmtokens can begin with a combining character names cannot, nor can
they contain spaces.
3. This would in no way change the content-model. So even if the above two
points didn't hold they would only sneak the document past something which
performed validation before parsing(!), and where the content-model was
already pretty loose (so it didn't complain about the unrecognised
attribute).

You've just discovered a way to disguise one document that isn't well-formed
as a different document that isn't well-formed. l33t!

> So this would invalidate existing documents, or create holes allowing
> insertion of arbitrary XML content, if the XML application is not
> validating extremely strictly the element names (the pair namespace+
> name) and exclude completely from processing any unrecognized
> element (including all its content and attributes).

This argument is not on friendly terms with the concept of causality.

This would be a
> breach in the content model which may have been validated and tested
> for security in another layer of the document encoding process (notably
> when XML documents are created from templates, such as XSL
> processors, or custom C source using simple template substitution).

Testing validity without testing well-formedness is not possible.

> So for me the sequence SPACE+combining should not be acceptable
> as a valid grapheme cluster within element names or attribute names,

As it already isn't.

> and thus would need to be excluded from NMTOKEN. The correct
> way to do it is to consider it NOT A LETTER, but a symbol (Sk),
> exactly like other spacing diacritics, which are already invalid in
> NMTOKEN.

Wait a second. That was my justification for why the fact that
space+combining is ALREADY prohibited from NMTOKEN shouldn't be considered a
failure on the part of XML to allow for freedom of choice with the strings
used for NMTOKENs. Now you actually want to introduce this (already
existent) feature.

> There still remains the unresolved question of grapheme clusters
> that could span the starting "<" or ending ">" or "/>" of tags, or
> the leading "&" of a entitity reference.

No there isn't. What goes before <, >, / or & isn't a problem since those
are all non-combining characters and a new unit for any sort of processing
treating more than one codepoint as a unit. What goes after < or & has to be
a name (not an nmtoken) and as such is already prohibited from beginning
with a combiner. What goes after > is already dealt with by the Charmod, and
even if you ignore charmod apart from the possibility of normalisation
turning the sequence U+003E, U+0338 into U+226E (a possibility that is well
noted) it still isn't going to hurt.

Next message: Marco Cimarosti: "RE: Handwritten EURO sign (off topic?)"
Previous message: John Cowan: "Re: Handwritten EURO sign (off topic?)"
In reply to: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"
Next in thread: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"
Reply: Philippe Verdy: "Re: Questions on ZWNBS - for line initial holam plus alef"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Aug 14 2003 - 09:29:15 EDT