Re: XML and Unicode interoperability comes before HTML or even SGML

From: Philippe Verdy (
Date: Sun Aug 15 2004 - 10:22:20 CDT

  • Next message: Philippe Verdy: "Re: [mo/mol] and [ro/ron/rum]"

    From: "Doug Ewell" <>
    > Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
    > > Shamely,
    > I wish I knew which real English word you mean by this. "Shamefully"?
    > "Sadly"? "Unfortunately"? "Embarrassingly"?

    I know that I use this word instead of "unfortunately". I don't know where I
    learnt it, but I use it frequently...

    > > the idea of "block-level" and "inline" elements is specific to HTML,
    > > but HTML today is an application of XML, and the problem must be
    > > solved at the XML level.
    > HTML is not an application of XML. HTML and XML are both applications
    > of SGML. XHTML, which I use and recommend, is an application of HTML
    > *to* XML.

    You did not need to specify this. I said "TODAY" which means the *current*
    standard version of HTML, which is now XHTML, i.e. really an application of
    XML (the legacy syntax with unclosed elements and unquoted attribute values,
    allowed in HTML and SGML, is being deprecated as it is forbidden in XML)...

    What I mean here is that a solution to disambiguate the grapheme cluster
    boundaries that collides during normalization with the ?ML lexical analysis,
    but that will work with the restricted XML syntax, will then work with
    XHTML, HTML4 or lower, or even with SGML, which is the ancestor of the
    It's a place where the W3C (for XML, XHTML and HTML4 or lower) and the SGML
    consortium can make recommandations.

    Of course there's the Unicode Technical Report #20 that speaks about the
    case of XML. For Unicode, it is informative, the most important thing is
    that this document is co-signed by the W3C, on 13 June 2003, and so is now
    an appropriate (but incomplete) response of the W3C for this problem.

    UTR#20 does not completely cover the subject, as there's still nothing with
    the change in Unicode 4.0.1, related to the use of ZW(J)J in rule D17 and

    May be Martin Dürst of the W3C should look precisely of the effect of D17
    and if UTR#20 should not be updated...

    I don't know if there's some similar recommandation from the SGML

    There may also exist similar problems in other languages or protocols using
    Unicode and which are possibly exposed now to this change which may break
    their existing syntax. In some of these cases, the solution with NCRs will
    not be so easy to find, and these other protocols or languages using Unicode
    may need to apply further restrictions about what they consider as "valid
    Unicode strings", or may simply choose to NOT apply the D17 change (so that
    a string containing only a ZW(N)J character will still be valid and won't
    collide with the language syntax).

    This archive was generated by hypermail 2.1.5 : Sun Aug 15 2004 - 10:25:52 CDT