Re: Definitions

From: Peter Kirk (
Date: Wed Nov 26 2003 - 10:49:31 EST

    On 26/11/2003 06:17, Philippe Verdy wrote:

    >Peter Kirk [] writes:
    >>Why is this a problem? Quotes and ">" with combining marks are
    >>presumably not legal HTML or XML;
    >You're wrong: it is legal in both HTML and XML. What is not specified
    >correctly is the behavior of HTML and XML parsers face to a XML or HTML
    >document claiming it is coded with a Unicode encoding scheme or any other
    >Unicode-compatible CES (like GB18030, but not completely with MacRoman as it
    >contains supplementary characters that are not part of the Unicode/ISO/IEC
    >10646 repertoire).
    OK, I used the wrong words here. A sequence of a quote or ">" followed
    by combining characters is legal HTML/XML with the interpretation of a
    quote or ">" introducing a quoted string or terminating a tag, followed
    by a defective combining sequence which is part of the quoted string or
    of the text following the tag. The question is, does such a sequence
    have any other legal interpretation, within the context of an HTML/XML
    tag? If not, there is no ambiguity.

    > ...
    >>There could of course be
    >>problems if there were any precomposed combinations of quotes or ">"
    >>with combining characters, but I don't think there are any, are there?
    >There are such precomposed sequences in Unicode. Look in
    >NormalizationTest.txt for the places where ">", single and double quotes are
    >used and part of a combining sequence... Notably look at sequences made with
    >the combining solidus overlay; add also the case of enclosing combining
    >characters, and of mathematical operators that can be created with a
    >combining sequence starting by ">" or "=" or single or double quotes, and
    >modified by diacritics.
    According to John Cowan there is just one such precomposed character,
    U+226F. As an HTML/XML document (the whole file, not just the parts
    between tags) is a Unicode string, the Unicode conformance rules would
    seem to mandate that an HTML/XML parser should parse U+226F exactly as
    if it were the sequence <">", U+0338>, i.e. as end of tag followed by a
    defective combining sequence. Normalisation stability implies that this
    precomposed character will always be the only such problem case, at
    least apart from composition exceptions, and so it is possible to write
    it into parsers as a special case. A bit messy, but less messy than
    using numeric entities or named entities.

    Peter Kirk (personal) (work)

