Re: Combining across markup? (Was: RE: sign for anti-neutrino - gree k nu with diacritical line aboveworkaround ?)

From: Peter Kirk (
Date: Wed Aug 11 2004 - 05:23:04 CDT

  • Next message: Peter Kirk: "Re: Combining across markup?"

    On 10/08/2004 23:13, D. Starner wrote:

    >Peter Kirk writes:
    >>That one is easy: this is the closing tag followed by a combining
    >>solidus. The difficult case is if the parser encounters a not greater
    >>than symbol. The parser will need to know to decompose such characters
    >>first, but then a good parser would always need to do that.
    >So all existing XML emitters should be changed, to make sure that not
    >less than symbols and not greater than symbols are escaped? If I were
    >writting a XML document with math content, and added a not less than
    >symbol, I would be sorely surprised to find it starting a tag. Being
    >a Unicode geek, I could figure it out, but I bet many mathematicians
    >wouldn't. Letting not less than symbols open tags would be a big

    If existing XML emitters are not to be changed, they MUST drop any claim
    that what they are emitting is a string of Unicode characters. This is a
    conformance issue. An XML emitter which emits a not less than symbol and
    assumes that it will not be interpreted as a tag start character
    followed by a combining mark is in breach of Unicode conformance rule
    "C9 A process shall not assume that the interpretations of two
    canonical-equivalent character sequences are distinct."

    W3C or whoever could get around this problem in at least three ways:

    1) Specify that the string is put into a particular normalisation form
    before parsing;

    2) Specify that "<" followed by a combining mark, or certain combining
    marks, is not interpreted as a tag start character but as a literal,
    i.e. this is some kind of escape mechanism;

    3) Specify that "not less than" etc must be escaped in one of the same
    ways as "less than" - which an intelligent editor could hide from

    Or Unicode could exceptionally change its decompositions here - which
    could be justified in that the main reason for refusing to change them
    is for compatibility with W3C, so W3C can't complain if they require
    this change.

    But without some such change there is a failure to conform to Unicode.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Wed Aug 11 2004 - 05:24:32 CDT