RE: Arabic letters separated by markup

From: Peter Constable (
Date: Thu Jun 09 2005 - 12:35:45 CDT

  • Next message: Mete Kural: "RE: Arabic letters separated by markup"

    > From: []
    On Behalf
    > Of Philippe Verdy

    > Unicode sees markup in a HTML file as if it was splitting the rich
    > into many distinct plain-text documents. What these extra markup will
    do is
    > also not specified.
    > So if you insert markup in the middle of a combining sequence, it is
    > longer a single combining sequence for Unicode. Instead it will be
    seen by
    > Unicode as a document ending with a correct combining sequence, and
    > document starting by a defective combining sequence.

    AFAIK, this personal opinion of Philippe's is not reflected anywhere in
    the Unicode Standard. The most likely place for it to be addressed would
    be UTR20, and it is silent on this matter.

    *My* opinion, supported by the silence of the Unicode Standard on the
    topic, is that it is up to the higher-level protocol -- the HTML spec --
    to specify what the impact of various markup elements may have on
    various text processes over the character content of a document. For
    instance, I would expect the sequences in <TD>abc</TD><TD>def</TD> to be
    treated as distinct document elements, implying no cursive connection
    between them (among other things), but I would expect the sequences
    <span>abc</span><span>def</span> to be considered a single text element
    for rendering purposes (barring further stylesheet effects -- a
    stylesheet might, of course, transform spans into distinct non-inline
    structural elements).

    Peter Constable

    This archive was generated by hypermail 2.1.5 : Thu Jun 09 2005 - 12:36:59 CDT