Re: XML and Unicode interoperability comes before HTML or even SGML (was: Combining across markup?)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Aug 14 2004 - 05:35:28 CDT

  • Next message: Anto'nio Martins-Tuva'lkin: "Re: [mo/mol] and [ro/ron/rum]"

    From: "Anto'nio Martins-Tuva'lkin" <antonio@tuvalkin.web.pt>
    > On 2004.08.11, 18:58, Mike Ayers <mike.ayers@tumbleweed.com> wrote:
    >
    > > Better yet, have a generic mechanism which allows you to build
    >
    > Even better yet: Have the WC3 rephrase their demand that no element
    > should start with a defective sequence (when considered in separate)
    > as that no *block-level* element should etc., and leave things like
    > <span>, <i> and other in-line elements free to start with a combining
    > character (provided that the said in-line container is not the first
    > within a block-level element, of course).

    Shamely, the idea of "block-level" and "inline" elements is specific to
    HTML,
    but HTML today is an application of XML, and the problem must be solved
    at the XML level.
    The only safe way to solve it at the XML level is to make the use of NCRs
    or named character references highly recommanded (if not mandatory)
    for the first character of a defective combining sequence. The solution
    based on a closed list of exceptions will not work with the evolutions of
    the Unicode/ISO/IEC 10646 repertoire of combining characters.

    Simply because, for both Unicode and ISO/IEC 10646, the character
    model includes the fact that ANY base character forms a combining
    character sequence with ANY following combining character or ZW(N)J
    character.

    The change in D17 of Unicode has also extended the risks of conflicts
    because now ZW(N)J can now *also* combine with and follow:

      - a quote mark (used to delimit the start and end of an attribute value),
      - the second minus/hyphen of a XML-comment leading mark,
      - the second opening square bracket of a CDATA leading mark,

    The D17 change in Unicode is already creating another interoperability
    problem with XML, because in the past ZW(N)J were treated only as
    combining sequence *starters* (now they also occur in the middle
    of a non-defective combining sequence, or at end of a defective
    combining sequence).

    One suggestion for this problem is that the change of D17 should be
    ignored at the XML parsing level (yes it breaks the new Unicode
    character model, but this is not critical to preserve the document
    parsed tree data).

    The other suggestion for Unicode is to include in the stability policy
    some words saying that there will be NO compatibility characters added
    with a canonical equivalence to a combining sequence ended by a
    ZW(N)J character. (I think or hope this will never happen, as this
    should never be needed for roundtrip compatibility...)

    All these cases must be first solved at the XML document syntax level,
    because these restrictions DO NOT apply at the DOM-tree level where
    all combining sequences, defective or not, are VALID in *both*
    XML and Unicode!

    Using NCRs can solve some of these problems when they occur within
    element attribute values or in text elements contents. This won't work
    within CDATA sections (the CDATA section needs to be closed first,
    and the NCR coded out of this section...)

    When all this will be solved, then the recommandation will apply to
    XHTML, and can further by applied retroactively to HTML 4.01 and
    before.

    The only safeguard of XML regarding the cases of ambiguities is that
    a XML document generator MUST NOT be created that contains
    unassigned Unicode characters. This means that a XML document
    generator (or editor) MUST know with which versin of Unicode it
    works, so that it will accept only characters that are known to be
    or not to be combining characters. This way, the XML document
    generators or editors will know exactly how to delimit combining
    sequences, and which ones are defective or not, so that they will
    use NCRs appropriately.

    This also means that a XML editor that loads a valid XML document,
    where NCRs are present, must not transform NCRs back to plain-text
    characters without knowing that character. A XML generator or editor
    that does not know the "combining" property of a character found in
    an input document, using its own known version of Unicode, will then
    need to save the edited or regenerated document using NCRs for
    every character that it does not know!

    > That would, IIUC, address WC3's "fears" and would OTOH satisfy the
    > need to mark up differently any part of a *text stream*.
    >
    > (Yes, of course CSS allows blobk elements to behave as in-line and
    > vice-versa, buat that would be the user's responsability...)

    You also point here the problem that block/inline distinctions will NOT
    work. Notably because these distinctions can also be changed
    dynamically using DOM scripting!

    This is especially true now with XHTML which has become completely
    modular, where a valid XHTML document can completely redefine the
    role of nearly all block or inline elements, using a custom DTD or XML
    schema definition, rendered with a XSLT transformation engine and
    a custom XSLT behavior file that will bind XHTML elements to the
    browser's renderer!



    This archive was generated by hypermail 2.1.5 : Sat Aug 14 2004 - 05:37:25 CDT