Re: Questions on ZWNBS - for line initial holam plus alef

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Aug 14 2003 - 10:01:28 EDT

Next message: Patrick Andries: "Re: Handwritten EURO sign (off topic?)"

Previous message: Marco Cimarosti: "RE: Handwritten EURO sign (off topic?)"
In reply to: Jon Hanna: "RE: Questions on ZWNBS - for line initial holam plus alef"
Next in thread: Jon Hanna: "RE: Questions on ZWNBS - for line initial holam plus alef"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

----- Original Message -----
From: "Jon Hanna" <jon@spin.ie>
To: <unicode@unicode.org>
Sent: Thursday, August 14, 2003 1:49 PM
Subject: RE: Questions on ZWNBS - for line initial holam plus alef

> > I do agree: a XML document could require the use at some place of a
> > given attribute or element. If this attribute name follows the
element
> > name
> > after a line break, which gets changed into a space during parsing,
> > forcing
> > XML parsers to treat SPACE+combining as a unbreakable grapheme
> > cluster acting like a letter would have the effect of creating a new
> > element
> > name which may violate the lement name identity. Now suppose that
the
> > attribute name contains a colon, you have created a custom namespace
> > name, under which you can add any element you like, even if this was
> > forbidden by the content-model of the reference schema.
>
> 1. SPACE is treated "blindly" as a SPACE by XML. String + space +
combining
> + string would not be treated as a single token, no matter how that
space
> was introduced. That's what you were complaining about in the first
place
> (as far as I can make out).
> 2. While nmtokens can begin with a combining character names cannot,
nor can
> they contain spaces.
> 3. This would in no way change the content-model. So even if the above
two
> points didn't hold they would only sneak the document past something
which
> performed validation before parsing(!), and where the content-model
was
> already pretty loose (so it didn't complain about the unrecognised
> attribute).
>
> You've just discovered a way to disguise one document that isn't
well-formed
> as a different document that isn't well-formed. l33t!
>
> > So this would invalidate existing documents, or create holes
allowing
> > insertion of arbitrary XML content, if the XML application is not
> > validating extremely strictly the element names (the pair namespace+
> > name) and exclude completely from processing any unrecognized
> > element (including all its content and attributes).
>
> This argument is not on friendly terms with the concept of causality.
>
> This would be a
> > breach in the content model which may have been validated and tested
> > for security in another layer of the document encoding process
(notably
> > when XML documents are created from templates, such as XSL
> > processors, or custom C source using simple template substitution).
>
> Testing validity without testing well-formedness is not possible.
>
> > So for me the sequence SPACE+combining should not be acceptable
> > as a valid grapheme cluster within element names or attribute names,
>
> As it already isn't.
>
> > and thus would need to be excluded from NMTOKEN. The correct
> > way to do it is to consider it NOT A LETTER, but a symbol (Sk),
> > exactly like other spacing diacritics, which are already invalid in
> > NMTOKEN.
>
> Wait a second. That was my justification for why the fact that
> space+combining is ALREADY prohibited from NMTOKEN shouldn't be
considered a
> failure on the part of XML to allow for freedom of choice with the
strings
> used for NMTOKENs. Now you actually want to introduce this (already
> existent) feature.
>
> > There still remains the unresolved question of grapheme clusters
> > that could span the starting "<" or ending ">" or "/>" of tags, or
> > the leading "&" of a entitity reference.
>
> No there isn't. What goes before <, >, / or & isn't a problem since
those
> are all non-combining characters and a new unit for any sort of
processing
> treating more than one codepoint as a unit. What goes after < or & has
to be
> a name (not an nmtoken) and as such is already prohibited from
beginning
> with a combiner. What goes after > is already dealt with by the
Charmod, and
> even if you ignore charmod apart from the possibility of normalisation
> turning the sequence U+003E, U+0338 into U+226E (a possibility that is
well
> noted) it still isn't going to hurt.

One note: in Unicode, grapheme clusters (considered unbreakable) are
more
than just combining sequences! Look at CGJ, WJ, ZWJ, ...
So what is after or *before* a base character may impact parsing
grapheme clusters!

As the well-formedness of XML documents goes even before its validity
(which is optional, but required in some applications that need to parse
the DOM-tree or InfoSet rather than), this impacts the way Unicode can
be used (read it as "embedded") within XML. Depending on where this
encoded text is used (NMTOKENs, text elements, attribute values,...)
the embedding constraints will be different, but in my opinion anonymous
text elements and attribute values should both use the same encoding
capabilities as they both can (should be able to) represent any kind
of valid Unicode plain text.

As SPACE is handled differently in attribute values, this is a problem.
that causes a problem for SPACE+NSM (considered valid but with
imprecise properties for now).

The constraints are less severe in anonymous text elements as there
exists several technics (including CDATA sections) to represent them.
In fact, XML will consider each text element or attribute value as an
independant text, that should be independantly legal for Unicode,
but the embedding thme in a well-formed and validated XML
document may create a document that may now be ill-formed or
parsed incorrectly with the Unicode analysis of grapheme clusters.

My opinion about it is that there's an apparent conflict here which can
prohibit using a fully conforming Unicode text editor for XML
documents, only because of grapheme clusters boundaries, input
methods, etc... When editing a XML document, the XML syntax
MUST be wellformed at the first level, and then only the Unicode
well-formness comes in second place.

If there are places where both constraints cannot be made compatible
with each other, either one can use a replacement encoding technic for
Unicode, or XML must allow using an escaping mechanism to allow
safe embedding of Unicode plain-text strings into the XML document,
without needing any change in the generated InfoSet or DOM tree.

If such escaping is impossible, then there's a hole in the XML syntax
that will have one consequence: authors will start violating either the
well-formedness of XML documents (using implementation-dependant
tolerance "features"), or will use "defective" or even "illegal" Unicode
sequences (which will appear as legal for a XML parser which considers
the XML wellformedness an absolute requirement for security). In both
cases, the XML documents will no longer be editable with plain-text
editors, but with specific XML editors that don't parse and manage the
document as a plain-text file in the first place.

Next message: Patrick Andries: "Re: Handwritten EURO sign (off topic?)"
Previous message: Marco Cimarosti: "RE: Handwritten EURO sign (off topic?)"
In reply to: Jon Hanna: "RE: Questions on ZWNBS - for line initial holam plus alef"
Next in thread: Jon Hanna: "RE: Questions on ZWNBS - for line initial holam plus alef"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Aug 14 2003 - 10:42:36 EDT