RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 09 2003 - 08:17:05 EST

  • Next message: jcowan@reutershealth.com: "Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)"

    > -----Message d'origine-----
    > De : Peter Kirk [mailto:peterkirk@qaya.org]
    > Envoye : mardi 9 decembre 2003 13:17
    > A : verdy_p@wanadoo.fr
    > Cc : Unicode@Unicode.Org
    > Objet : Re: Coloured diacritics (Was: Transcoding Tamil in the presence
    > of markup)
    >
    >
    > On 09/12/2003 03:41, Philippe Verdy wrote:
    >
    > >Peter Kirk writes:
    > >
    > >
    > >>Philippe, you have now stated this (several times). But just a day
    > >>earlier you yourself stated that the rule forbidding combining marks at
    > >>the start of a string would never be relaxed because it is fundamental
    > >>to the XML containment model. You don't usually contradict yourself
    > >>quite so obviously.
    > >>
    > >>
    > >
    > >I don't know how you interpreted what I may have said a few days before.
    > >I have certainly not said that XML forbids combining marks at the start
    > >of XML, just that W3C does not _recommand_ it as well as any other
    > >defective combining sequences, as they are known to cause problems
    > >(for example when it's difficult to track the effective text file type)
    > >
    > >
    > So, let's get this clear. Within an XML or HTML document, if I want an e
    > with a red acute accent on it, it is quite permissible to write:
    >
    > e<span class="red-text">{U+0301}</span>
    >
    > where {U+0301} is replaced by the actual Unicode character, and
    > "red-text" is defined in the stylesheet. So it is not a problem that
    > there is a defective combining sequence, nor that the accent is not
    > combined with the e as it would be in NFC. Is that correct?

    That's right: the text element within <span> just contains the string with
    the isolated diacritic, it is already in NFC form despite it is defective.
    And it must not be parsed by creating a combining sequence that includes
    the ">" terminating the <span> tag (interpretation of combining sequences
    is only valid within plain-text, and thus excludes syntactic characters
    used in XML.

    Note that this is not specific to XML. Any "text/*" format that is not
    plain text (notably programming source files, shell scripts, HTML files,
    stylesheets, and JavaScript files) should be handled this way, where
    the syntax of the language governs the rules for parsing it, before
    even trying to use Unicode definitions on parsed tokens used in that
    programming language.

    So normalization should never be performed on whole files that are not
    explicitly of file type "text/plain" (either with an explicit meta-data
    such as MIME headers during transmissions, or locally with OS-specific
    conventions on file extension such as ".txt")

    When in doubt, for example in CVS repositories or in diff/merge tools,
    normalization must not be performed, and the current encoding form of
    text files must be preserved, each time that tools does not implement
    an accurate parser for the syntaxic and lexical rules of the effective
    file type or language, which may or may not accept defective combining
    sequences as valid plain-text strings (this includes identifiers,
    however Unicode recommands a list of characters that can be used to
    start an identifier, and this list excludes all non-starter combining
    characters.)

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Tue Dec 09 2003 - 09:27:05 EST