From: Philippe Verdy (firstname.lastname@example.org)
Date: Thu Dec 09 2004 - 09:01:24 CST
From: "Marcin 'Qrczak' Kowalczyk" <email@example.com>
> Ok, so it's the conversion from raw text to escaped character
> references which should treat combining characters specially.
> What about < with combining acute, which doesn't have a precomposed
> form? A broken opening tag or a valid text character?
Also a broken opening tag for HTML/XML documents (which are NOT plain text
documents, and must be first parsed as HTML/XML, before parsing the many
text sections contained in text elements, element names, attribute names,
attribute values (etc...) as plain-text under the restrictions specified in
the HTML or XML specifications (which contain restriction for example on
which characters are allowed in names).
The XML/HTML core syntax is defined with fixed behavior of some individual
characters like '&', '<', quotation marks, and with special behavior for
spaces. This core structure is not plain-text, and cannot be overriden, even
by Unicode grapheme clusters.
Note that HTML/XML do NOT mandate the use or even the support of Unicode,
just the support of a character repertoire that contains some required
characters, and the acceptance of at least the ISO/10646 repertoire under
some conditions, however the encoding to code points itself is not required
for something else than numeric character references, which are more
symbolic in a way similar to other named character entities in SGML, than
absolute as implying the required support of the repertoire with a single
So you can as well create fully conforming HTML or XML documents using a
character set which includes characters not even defined in Unicode/ISO/IEC
10646, or characters defined only symbolically with just a name. Whever this
name will map or not to one or more Unicode characters does not change the
validity of the document itself.
And all the XML/HTML behavior ignores almost all Unicode properties
(including normalization properties, because XML and HTML treat different
strings, which are still canonically equivalent, as completely distinct; an
important feature for cases like XML Signatures, where normalization of
documents should not be applied blindly as it would break the data
If you want to normalize XML documents, you should not do it with a
normalizer working on the whole document as if it was plain-text. Instead
you must normalize the individual strings that are in the XML InfoSet, as
accessible when browsing the nodes of its DOM tree, and then you can
serialize the normalized tree to create a new document (using CDATA sections
and/or character references, if needed to escape some syntaxic characters
reserved by XML that would be present in the string data of DOM tree nodes).
Note also that a XML document containing references to Unicode
non-characters would still be well-formed, because these characters may be
part of a non-Unicode charset.
XML document validation is a separate and optional problem from XML parsing
which checks well-formedness and builds a DOM tree: validation is only
performed when matching the DOM tree according to a schema definition, DTD
or XSD, in which additional restrictions on allowed characters may be
checked, or in which additional symbolic-only "characters" may be defined
and used in the XML document with parsable named entities similar to:
(An example: the schema may contain a definition for a "character"
representing a private company logo, mapped to a symbolic name; the XML
document can contain such references, but the DTD may also define an
encoding for it in a private charset, so that the XML document will directly
use that code; the Apple logo in Macintosh charsets is an example, for which
an internal mapping to Unicode PUAs is not sufficient to allow correct
processing of multiple XML documents, where PUAs used in each XML documents
have no equivalence; the conversion of such documents to Unicode with these
PUAs is a lossy conversion, not suitable for XML data processing).
This archive was generated by hypermail 2.1.5 : Thu Dec 09 2004 - 09:06:14 CST