Re: Questions on ZWNBS - for line initial holam plus alef

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Aug 12 2003 - 12:00:43 EDT

  • Next message: ekeown@student.umass.edu: "Re: [hebrew] Re: Roadmap---Mandaic, Early Aramaic, Samaritan"

    From: "Jon Hanna" <jon@spin.ie>

    > If this is
    > > different, then it is not XML but a derived language (for example
    HTML or
    > > SGML which are using more "relaxed" syntaxes).
    >
    > XML is derived from SGML, not the other way around. Still doesn't
    matter.

    I did not say that, despite the sentence may let you think so. Of course
    XML is born based on the ground of SGML and its HTML application, but
    now contains enough differences that it can no longer be considered an
    application of SGML, as it is both a subset and a superset of SGML (XML
    allows things forbidden in SGML, and forbids things that is completely
    valid in SGML).

    Additionally the DTD syntax profile used in XML is very limited face to
    SGML, and even this DTD syntax is not enough to represent in SGML XML
    features like namespaces (in XML, namespace prefixes can be freely
    substituted without requiring a new DTD, and are resolved as URIs
    instead of being part of the element or attribute names). Naming
    conventions in XML are based on two orthogonal dimensions, unlike in
    HTML and SGML which just use a single namespace.

    Finally DTDs are being deprecated in XML, because they cannot represent
    correctly the semantics of allowed attributes and even the allowed
    content models for schemas (so a XML document would validate with a DTD
    which would not if the schema was defined more precisely with a XSD
    schema: nearly all DTDs I have seen for XML, HTML and SGML contain
    important comments that cannot be represented in a parsable way.

    OK I used the term DOM instead of InfoSet but what I said was "DOM-like"
    data-representation (meaning InfoSet if this is what is used to
    represent the document). I won't discuss the case of element names or
    attribute names, which
    are by essence constrained by XML datatypes and do not represent any
    arbitrary Unicode text. But CDATA sections, attribute values (in non
    validating parsers), and anonymous text elements are where the handling
    of initial/final whitespaces as well as sequences of whitespaces, cause
    problems. This is clearly NOT markup, but plain text data, which may or
    may not be constrained by datatype facets, without even the need to
    specify a special xml:whitespace
    attribute in the markup of the document itself.

    As validating documents against their definitions is an optional part of
    a valid XML document, normalization of whitespace sequences occurs only
    if the schema is known. In the case of standardized schemas, like XHTML,
    it becomes mandatory, and there's no way to bypass this rule, as any
    client could assume and load the corresponding schema and preprocess the
    DOM-like data contained in the parsed document to create data which will
    not expose unnormalized whitespaces. So the behavior of spaces must be
    assumed by authors which canot predict if the XML parser will validate
    or not the parsed document. It is clearly not a rendering issue in fonts
    or XSLT processors or stylesheets. I see absolutely no place where a XML
    author can create a valid XML schema instance that will work with
    parsers if the author wants to use SPACE+diacritics sequences in the
    document. The only way to bypass safely this behavior is to use unparsed
    entities to represent the leading SPACE, or the whole combining
    sequence.

    This is really a shame that there is no "XML-safe" base character in
    Unicode to represent leading spacing diacritics in actual documents
    (either in HTML, XML, SGML, or even for other Rich-Text format,
    including TeX, RTF, or proprietary text formats like MS-Doc, or PDF
    which already can and do use Unicode as its now prefered encoding).
    Ignoring the extremely huge number of applications assuming this role to
    spaces, is then a critical caveat as such rules cannot be changed
    easily.



    This archive was generated by hypermail 2.1.5 : Tue Aug 12 2003 - 12:47:41 EDT