Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

From: jcowan@reutershealth.com
Date: Tue Dec 09 2003 - 07:52:14 EST

  • Next message: jon@hackcraft.net: "Re: [OT]"

    Philippe Verdy scripsit:

    > When in doubt, don't perform any normalization of XML _files_ as they are
    > NOT plain text: you need a XML parser to do it safely only in relevant
    > sections of this file. All you could do safely is to possibly reencode XML
    > files (for example from UTF-8 to UTF-16 encoding schemes).

    This is wildly overstated. XML files most certainly are plain text,
    though they may be interpreted as fancy text in contexts that understand
    XML. With the insignificant exception of a markup ">" immediately
    followed by a U+0338 character, it is entirely safe to normalize XML
    files according to any normalization. (It is true that NK* normalization
    forms may lose information, but XML document authors are discouraged
    from using compatibility decomposables in any case.)

    What is not allowed, and this makes XML technically non-conformant to the
    Unicode Standard, is to make arbitrary and unsystematic replacements of
    one canonically equivalent form with another. For example, if an element
    name is "h)Bétérogénéité" (a favorite word of mine), decomposing the
    start-tag while leaving the end-tag composed would make the document no
    longer well-formed XML. In my opinion, this is a corner case that may
    be safely ignored.

    -- 
    John Cowan  www.reutershealth.com  www.ccil.org/~cowan  jcowan@reutershealth.com
    'Tis the Linux rebellion / Let coders take their place,
    The Linux-nationale / Shall Microsoft outpace,
    We can write better programs / Our CPUs won't stall,
    So raise the penguin banner of / The Linux-nationale.
    


    This archive was generated by hypermail 2.1.5 : Tue Dec 09 2003 - 08:31:12 EST