RE: Definitions

From: jon@hackcraft.net
Date: Wed Nov 26 2003 - 10:43:44 EST

  • Next message: Arcane Jill: "RE: numeric properties of Nl characters in the UCD"

    Quoting Philippe Verdy <verdy_p@wanadoo.fr>:

    > Peter Kirk [mailto:peterkirk@qaya.org] writes:
    > > Why is this a problem? Quotes and ">" with combining marks are
    > > presumably not legal HTML or XML;
    >
    > You're wrong: it is legal in both HTML and XML. What is not specified
    > correctly is the behavior of HTML and XML parsers face to a XML or HTML
    > document claiming it is coded with a Unicode encoding scheme or any other
    > Unicode-compatible CES (like GB18030, but not completely with MacRoman as it
    > contains supplementary characters that are not part of the Unicode/ISO/IEC
    > 10646 repertoire).
    >
    > > and so the interpretation of a quotes
    > > or ">" followed by combining marks as a quote or ">" and a defective
    > > combining sequence is unambiguous, surely?
    >
    > No it is not: there's a problem of prevalence between XML/HTML/SGML parsing
    > rules, and Unicode parsing rules. Using character entities can solve this
    > problem, but I would really prefer that the W3 accepts a modification of its
    > parsing rules so that any text element or attribute value starting by a
    > defective combining sequence MUST NOT be interpreted as such using the
    > simple encoding scheme.

    The Character Model defines degrees of normalisation of text which go beyond
    NFC to prohibit the sequences you describe. Standards can use these definitions
    to prevent the issues associated with them.

    > > Your proposed solution to the problem is messy in requiring the use of
    > > numeric entities, and unnecessary.
    >
    > This is not that messy. Also I did not say that numeric entities must be
    > used. Any parsed named entity could be used as well. This is not a problem
    > of the Unicode standard, but a problem of the SGML, HTML 4.01, and XML
    > standards. For SGML and HTML up to 4.01, you also have problems with the
    > equal sign (because the quotes around element's attribute values are not
    > mandatory, unlike in XML).

    It is messy, because it would have to occur on serialisation from a model of an
    XML document which hid the use of entities. Hence if we parsed
    <examlple>&sol;</example> where &sol; expanded to the single character U+0338
    followed by the text " is a reverse solidus character" then we might have that
    stored as a text node of that character, receive it as a text event of that
    character, etc. in expanded form.

    On serialisation we would have to serialise as <examlple>&#x338; is a reverse
    solidus character</example> which would be relatively difficult to produce,
    though considerably easier to produce than the original &sol;

    Of course in this case it's more crucial than in others (since inserting the
    character directly into the stream and then normalising it with NFC would
    produce and non-well formed document, which isn't true with other combining
    characters).

    In all I would rather ban all defective sequences, by enforcing the W3C
    character model. I dont' see much point for them. The only possible reason I
    can think of right now is to allow description of the character itself, though
    that would possibly best be done through an element that represents the concept
    of a Unicode character along the lines of <character codepoint="824" />.



    This archive was generated by hypermail 2.1.5 : Wed Nov 26 2003 - 11:36:28 EST