RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

From: jon@hackcraft.net
Date: Wed Dec 10 2003 - 06:38:41 EST

  • Next message: Peter Kirk: "Re: Text Editors and Canonical Equivalence (was Coloured diacriti cs)"

    > > I've seen text/cpp and text/java, but really there are no such
    > > types. I've also
    > > seen text/x-source-code which is at least legal, if of little value to
    > > interoperability.
    > >
    > > The correct MIME type for C and C++ source files is text/plain.
    >
    > This is where I disagree:

    Bring forth the proofs.

    a plain text file makes no difference of
    > interpretation between their meta-linguistic meaning for the programming
    > language that uses and need it, and the same characters used to create
    > string constants or identifier names.

    Yep. No distinction whatsoever.

    > Unicode cannot, and must not, specify how the meta-characters used in a
    > programming language must combine with other actual strings that are treated
    > by the language syntax itself as _separate tokens_. This means that the
    > concept of combining sequences MUST NOT be used across all language token
    > boundaries. These boundaries are out of the spec of Unicode, but part of the
    > spec for the language, and they must be respected at the first level even
    > before trying to create other combining sequences within the _same_ token.

    C and C++ both describe how a compiler deals with text it receives, beyond
    saying that the source files must be converted to the "native" character set
    they have nothing further to say about the way characters and control
    characters may or may not affect each other.

    Unicode describes how characters may or may not affect each other, as well as
    specifying some encoding forms.

    The interface between these scopes is problematic, but the problems aren't
    solved by your saying that if a compiler chooses to use Unicode text that it
    somehow doesn't have to play by Unicode's rules.

    > So even if "text/c", "text/cpp", "text/pascal" or "text/basic" are not
    > officially registered (but "text/java" and "text/javascript" are
    > registered...)

    You mean "So even
    if "text/c", "text/cpp", "text/pascal", "text/basic", "text/java"
    or "text/javascript" are not officially registered" surely?

     it is important to handle text sources that aren't plain
    > texts as another "text/*" type, for example "text/x-other" or
    > "text/x-source" or "text/x-c" or "text/x-cpp".

    They'd still have to be treated as text/* types, including the fallback
    behaviour of treating them as text/plain if the charset is known and as
    application/octet-stream otherwise.

    > > I'd be prepared
    > > to give good odds that that is the case with Java source files as well.
    >
    > As I said "text/java" is the appropriate MIME type for Java source files...

    I see no text/java. This could be my eyesight but find-and-replace can't find
    it either.

    java.sun.com uses text/plain to transmit at least some java source files (I did
    a small survey, I have no intention of HEADing every one of the 4630 URIs I
    found there ending in .java).

    > Just imagine what would be created with your assumption with this source:
    > const wchar_t c = L'?';
    > where ? is a combining character. Using the plain/text content type for this
    > C source would imply that it combines with the previous single-quote. This
    > would create an opportunity for canonical composition, and thus would create
    > an "equivalent" source file which would be:
    > const wchar_t c = L§';
    > where this § character is a composed character. Now the source file contains
    > a
    > syntax error and does not compile, even though the previous source compiled
    > and was giving to the c constant the value of the codepoint coding the
    > ? diacritic...

    Identifying a problem does not mean you automatically have found the solution,
    less still that the solution you hit upon is already prescribed by the relevant
    standards.

    I don't see your example being much use as source though, human readers are
    hardly likely to find an apostrophe with a cedilla below it and a circumflex
    above it to be particularly readable code (whereas compilers would currently
    not have much difficulty since there are no decompositions beginning with
    either U+0022 or U+0027).

    > Of course the programmer could avoid this nightmare by using numeric
    > character
    > references as in:
    > const wchar_t c = L'\U000309';

    \u must be followed by four hexadecimal digits, \U by eight.

    The biggest advantage of L'\u0309' over direct use of the combining character
    is you can read the thing (source is intended for human readers as well as
    compilers, the infamous and aptly, if crudely, named brainf**k is an example of
    what programming languages would look like if this were not the case).

    Similarly this also enables us to explicitly state the order of combining
    diacritics, which conceivably a programmer may want to do but which neither C9
    nor simple matters of legibility enable one to do with direct use of the
    characters.

    --
    Jon Hanna                   | Toys and books
    <http://www.hackcraft.net/> | for hospitals:
                                | <http://santa.boards.ie>
    


    This archive was generated by hypermail 2.1.5 : Wed Dec 10 2003 - 07:23:00 EST