RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 09 2003 - 14:36:31 EST

  • Next message: Peter Kirk: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"

    jon@hackcraft.net writes:
    > > > You might as well say that C code is not plain text because it too is
    > > > subject to special canons of interpretation.
    > >
    > > C, C++ and Java source files are not plain text as well (they
    > > have their own
    >
    > C, C++ and Java source files are plain text.
    >
    > > "text/*" MIME type, which is NOT "text/plain" notably because
    > > of the rules
    >
    > I've seen text/cpp and text/java, but really there are no such
    > types. I've also
    > seen text/x-source-code which is at least legal, if of little value to
    > interoperability.
    >
    > The correct MIME type for C and C++ source files is text/plain.

    This is where I disagree: a plain text file makes no difference of
    interpretation between their meta-linguistic meaning for the programming
    language that uses and need it, and the same characters used to create
    string constants or identifier names.

    Unicode cannot, and must not, specify how the meta-characters used in a
    programming language must combine with other actual strings that are treated
    by the language syntax itself as _separate tokens_. This means that the
    concept of combining sequences MUST NOT be used across all language token
    boundaries. These boundaries are out of the spec of Unicode, but part of the
    spec for the language, and they must be respected at the first level even
    before trying to create other combining sequences within the _same_ token.

    So even if "text/c", "text/cpp", "text/pascal" or "text/basic" are not
    officially registered (but "text/java" and "text/javascript" are
    registered...) it is important to handle text sources that aren't plain
    texts as another "text/*" type, for example "text/x-other" or
    "text/x-source" or "text/x-c" or "text/x-cpp".

    > I'd be prepared
    > to give good odds that that is the case with Java source files as well.

    As I said "text/java" is the appropriate MIME type for Java source files...

    > > associated with end-of-lines, notably in presence of comments).
    >
    > As source files (that is, at the stage in processing at which a
    > human user can see the source and edit it) the only handling required
    > for end-of-lines is converstion of new line function characters, the same
    > as for any other use of plain text.
    >
    > The treatment of end-of-lines as significant when processed (for example
    > following one-line // comments) is a matter of what an
    > application chooses to do with a particular character. This is no
    > different than an indexer deciding that a plain text file contains a
    > particular word, or for that matter in my putting coffee filters into my
    > basket if I see "coffee filters" written on my shopping list.

    Just imagine what would be created with your assumption with this source:
            const wchar_t c = L'?';
    where ? is a combining character. Using the plain/text content type for this
    C source would imply that it combines with the previous single-quote. This
    would create an opportunity for canonical composition, and thus would create
    an "equivalent" source file which would be:
            const wchar_t c = L§';
    where this § character is a composed character. Now the source file contains
    a
    syntax error and does not compile, even though the previous source compiled
    and was giving to the c constant the value of the codepoint coding the
    ? diacritic...

    Of course the programmer could avoid this nightmare by using numeric
    character
    references as in:
            const wchar_t c = L'\U000309';
    or may be (but less portable, as it assumes the runtime encoding form used
    by
    wchar_t as being UCS4 or UTF-16 or UTF2, when the source file may be coded
    in a non-Unicode charset):
            const wchar_t c = (wchar_t)0x000309ul;

    > > > But both XML/HTML/SGML and the various programming languages are plain
    > > text.
    > >
    > > See "text/xml", "text/html" and "text/sgml" MIME types. They also aren't
    > > "text/plain" so they have their own interpretation of Unicode characters
    > > which is not the one found in the Unicode standard.
    >
    > They have their own interpretation of tne Unicode characters which is *in
    > addition to*

    This is not *in addition* but *instead of* and thus this breaks the rule
    of Unicode conformance at that level, as the code point does not match the
    meaning REQUIRED by conforming applications as being a code point, coding
    an abstract character with a well-defined representative glyph and
    REQUIRED composability with surrounding characters.

    > the one found in the Unicode standard. As to all but the simplest
    > applications that use Unicode (as interesting as many of them are,
    > characters are of little use on their own).

    Note that a simple text editor such as NotePad can safely be used to edit
    source files, simply because it does not attempt to perform any
    normalization
    of the loaded or saved files, even when editing it (there's not even a edit
    menu option to normalise any area of the text in the edit buffer).

    Most editors for programming languages treat individual characters as really
    individual and completely unrelated to each other. This means that they
    won't
    attempt any normalization, so characters will not be reordered, or
    recomposed.
    This is an important and needed requirement for programming source files,
    but
    it is not required for plain text files.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Tue Dec 09 2003 - 15:24:09 EST