Re: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

From: Philippe Verdy (
Date: Thu Aug 12 2004 - 15:03:14 CDT

  • Next message: Philippe Verdy: "Re: Combining across markup?"

    > This means that the rules of XML conflict with the rules of Unicode. If
    > the string is a Unicode string, U+226F is canonically equivalent to
    > <U+003E, U+0338> and therefore any higher level protocol should treat
    > the two sequences as identical, rather than reject one of them as
    > causing the document to be ill-formed.

    There's no conflict here:
    will not be *canonically equivalent* (for Unicode) to:
        (here a exclamation point is used instead of the combining solidus)
    which is canonically equivalent (for Unicode) to:
        (here I use a # instead of the <not greater than> character)

    Internally, in the parsed XML tree, the two syntaxes "&#x338;" and "/"
    (combining) will produce the same internal U+0338 character in the DOM tree.
    So the problem is purely a choice of syntax, because the two first elements
    above would be treated identically by any compliant XML parser.

    When there's a conflict, use a NCR: this is completely equivalent for all
    compliant XML parsers. A XML document generator can know such exception, and
    can generate a NCR in the XML document, each time the U+0338 character must
    be coded in the first position of a text element (a text element is
    necessarily following a closing element tag in any wellformed XML document).
    This will resist to any Unicode normalization applied to the whole XML

    Note however that a Unicode normalization *modifies* the XML document: XML
    ignores the Unicode canonical equivalences so it will treet the precombined
    character <e-acute> differently from the two characters <e, acute>. If a
    document is transcoded from Unicode to another charset with an algorithm
    that does not apply a one-to-one mapping of encoded characters, the new
    document will *not* be equivalent for XML (for most legacy charsets, the
    transcoding from this charset to Unicode is normally one-to-one, so most
    document parsers will parse a legacy XML document into a DOM tree containing
    Unicode strings).

    For XML generators that use an internal DOM representation before generating
    the XML document syntax, any character that cannot be mapped one-to-one in
    the target charset of the document MUST use a NCR; not doing so will create
    a document that will be later parsed as different from the original DOM

    This is true also for all XML related APIs: DOM, SAX, ... when they are used
    to get information from the parsed document tree, or when using
    authentication of XML document contents (the XML semantic of XML-ignorable
    whitespaces is considered, and space normalization will apply before the
    signature is computed): they return either an exact Unicode string, or an
    approximation of the actual DOM content if this information is requested in
    another legacy charset because this would imply a lossy conversion, unless
    the request to that API specifies that NCRs are allowed in the data returned
    from the DOM tree by such API.

    As a consequence, a compliant XML parser MUST NOT apply any Unicode
    normalization to the parsed entities (text elements, element names,
    attribute names, attribute values, processing instructions...) without being
    instructed to do so.

    So there's NO conflict between XML document equivalence and Unicode
    canonical equivalence: they are not the same, and they don't need to be the

    This archive was generated by hypermail 2.1.5 : Thu Aug 12 2004 - 15:55:26 CDT