From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Aug 12 2003 - 12:00:43 EDT
From: "Jon Hanna" <jon@spin.ie>
> If this is
> > different, then it is not XML but a derived language (for example
HTML or
> > SGML which are using more "relaxed" syntaxes).
>
> XML is derived from SGML, not the other way around. Still doesn't
matter.
I did not say that, despite the sentence may let you think so. Of course
XML is born based on the ground of SGML and its HTML application, but
now contains enough differences that it can no longer be considered an
application of SGML, as it is both a subset and a superset of SGML (XML
allows things forbidden in SGML, and forbids things that is completely
valid in SGML).
Additionally the DTD syntax profile used in XML is very limited face to
SGML, and even this DTD syntax is not enough to represent in SGML XML
features like namespaces (in XML, namespace prefixes can be freely
substituted without requiring a new DTD, and are resolved as URIs
instead of being part of the element or attribute names). Naming
conventions in XML are based on two orthogonal dimensions, unlike in
HTML and SGML which just use a single namespace.
Finally DTDs are being deprecated in XML, because they cannot represent
correctly the semantics of allowed attributes and even the allowed
content models for schemas (so a XML document would validate with a DTD
which would not if the schema was defined more precisely with a XSD
schema: nearly all DTDs I have seen for XML, HTML and SGML contain
important comments that cannot be represented in a parsable way.
OK I used the term DOM instead of InfoSet but what I said was "DOM-like"
data-representation (meaning InfoSet if this is what is used to
represent the document). I won't discuss the case of element names or
attribute names, which
are by essence constrained by XML datatypes and do not represent any
arbitrary Unicode text. But CDATA sections, attribute values (in non
validating parsers), and anonymous text elements are where the handling
of initial/final whitespaces as well as sequences of whitespaces, cause
problems. This is clearly NOT markup, but plain text data, which may or
may not be constrained by datatype facets, without even the need to
specify a special xml:whitespace
attribute in the markup of the document itself.
As validating documents against their definitions is an optional part of
a valid XML document, normalization of whitespace sequences occurs only
if the schema is known. In the case of standardized schemas, like XHTML,
it becomes mandatory, and there's no way to bypass this rule, as any
client could assume and load the corresponding schema and preprocess the
DOM-like data contained in the parsed document to create data which will
not expose unnormalized whitespaces. So the behavior of spaces must be
assumed by authors which canot predict if the XML parser will validate
or not the parsed document. It is clearly not a rendering issue in fonts
or XSLT processors or stylesheets. I see absolutely no place where a XML
author can create a valid XML schema instance that will work with
parsers if the author wants to use SPACE+diacritics sequences in the
document. The only way to bypass safely this behavior is to use unparsed
entities to represent the leading SPACE, or the whole combining
sequence.
This is really a shame that there is no "XML-safe" base character in
Unicode to represent leading spacing diacritics in actual documents
(either in HTML, XML, SGML, or even for other Rich-Text format,
including TeX, RTF, or proprietary text formats like MS-Doc, or PDF
which already can and do use Unicode as its now prefered encoding).
Ignoring the extremely huge number of applications assuming this role to
spaces, is then a critical caveat as such rules cannot be changed
easily.
This archive was generated by hypermail 2.1.5 : Tue Aug 12 2003 - 12:47:41 EDT