RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Dec 08 2003 - 18:10:59 EST

  • Next message: Philippe Verdy: "RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)"

    Peter Kirk writes:
    > Agreed. But now we are told that the latter is illegal XML because a
    > combining mark is not permitted (by XML, not by Unicode) after <span>.

    It is not forbidden by XML. It's just that handling a XML file (which is not
    plain-text) as if it was a Unicode plain-text when performing normalization
    of the file may produce unexpected composition of characters which are part
    of the XML syntax.

    This creates problems in the following cases where a defective combining
    sequence is used in XML:
    - a quotation mark that delimits the start of XML attribute values,
    - the opening bracket that delimits the start of a CDATA section,
    - the superior sign that closes a XML tag or processing instruction
    - the text content of <script> or <style> or <object> -like elements which
    may contain various delimiting characters to enclose Unicode string values,
    these problems depending on the scripting language actually used in these
    elements, which is not plain text.

    For these reasons, normalization should be used with care on XML files, and
    XML encoders may need to consider the XML syntax at the first level, and
    avoid converting the whole file as if it was plain text, but rather should
    encode each plain-text string that occurs within the parsed XML tree,
    possibly by using numeric or named character entities to encode the initial
    diacritics in those strings that start by defective combining sequences.

    In that case (with all cares taken in the XML encoder), a XML parser will
    never be dumbed by an input NFC normalizer, but will still be able to
    represent texts containing defective combining sequences without collision
    with the XML syntax.

    The W3C just _recommends_ the NFC form, but does not mandate it. In XML,
    text elements and attribute values are just data and are not limited or
    intended to represent only plain text. That's a good reason why defective
    combining sequences are not even forbidden in XML, and why a XML parser is
    not supposed to force any normalization of its input:

    The _Unicode canonical equivalence_ of strings is not considered as
    _equality_ in XML, and XML considers canonically equivalent strings coded
    with distinct sequences of code points as _distinct_ for processing purpose
    (it's up to the application using the parsed XML DOM-tree or InfoSet to see
    if normalization of the "text" elements and attribute values are plain-text
    and should be normalized before actual processing (for example by a XSLT
    stylesheet).

    When in doubt, don't perform any normalization of XML _files_ as they are
    NOT plain text: you need a XML parser to do it safely only in relevant
    sections of this file. All you could do safely is to possibly reencode XML
    files (for example from UTF-8 to UTF-16 encoding schemes).

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Mon Dec 08 2003 - 18:58:10 EST