RE: Definitions

From: Philippe Verdy (
Date: Wed Nov 26 2003 - 09:17:55 EST

  • Next message: "Re: numeric properties of Nl characters in the UCD"

    Peter Kirk [] writes:
    > Why is this a problem? Quotes and ">" with combining marks are
    > presumably not legal HTML or XML;

    You're wrong: it is legal in both HTML and XML. What is not specified
    correctly is the behavior of HTML and XML parsers face to a XML or HTML
    document claiming it is coded with a Unicode encoding scheme or any other
    Unicode-compatible CES (like GB18030, but not completely with MacRoman as it
    contains supplementary characters that are not part of the Unicode/ISO/IEC
    10646 repertoire).

    > and so the interpretation of a quotes
    > or ">" followed by combining marks as a quote or ">" and a defective
    > combining sequence is unambiguous, surely?

    No it is not: there's a problem of prevalence between XML/HTML/SGML parsing
    rules, and Unicode parsing rules. Using character entities can solve this
    problem, but I would really prefer that the W3 accepts a modification of its
    parsing rules so that any text element or attribute value starting by a
    defective combining sequence MUST NOT be interpreted as such using the
    simple encoding scheme. If a XML document is serialized into a text file
    with a encoding scheme, the generated file should (I would prefer "must")
    not encoding these defective sequences with the encoding scheme, but with
    character references only.

    This would allow to use the exactly SAME text parser used in Unicode as the
    input for the lexical and grammatical analysis of the XML/HTML/SGML parser.
    Within that model, the sequence ">" + combining character would be seen as a
    single combining sequence coding a abstract character, that breaks the
    syntax of expected end of tags. Same thing for the quotes delimiting the
    start of attribute values or for the square bracket delimiting the start of
    a CDATA section.

    > There could of course be
    > problems if there were any precomposed combinations of quotes or ">"
    > with combining characters, but I don't think there are any, are there?

    There are such precomposed sequences in Unicode. Look in
    NormalizationTest.txt for the places where ">", single and double quotes are
    used and part of a combining sequence... Notably look at sequences made with
    the combining solidus overlay; add also the case of enclosing combining
    characters, and of mathematical operators that can be created with a
    combining sequence starting by ">" or "=" or single or double quotes, and
    modified by diacritics.

    > Your proposed solution to the problem is messy in requiring the use of
    > numeric entities, and unnecessary.

    This is not that messy. Also I did not say that numeric entities must be
    used. Any parsed named entity could be used as well. This is not a problem
    of the Unicode standard, but a problem of the SGML, HTML 4.01, and XML
    standards. For SGML and HTML up to 4.01, you also have problems with the
    equal sign (because the quotes around element's attribute values are not
    mandatory, unlike in XML).

    We don't have this problem for element names or attribute names, because
    they must obey a stricter syntax and can't be any arbitrary Unicode string:
    these names cannot contain defective combining sequences simply because
    combining characters cannot be identifier starts.

    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE!

    This archive was generated by hypermail 2.1.5 : Wed Nov 26 2003 - 10:27:07 EST