RE: New Charakter Proposal

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Oct 30 2002 - 09:46:05 EST

  • Next message: Dominikus Scherkl: "RE: New Charakter Proposal"

    Dominikus Scherkl wrote:
    > I would like to have a "source failure indicator symbol" (SFIS)
    > charakter in the unicode, which a charset-convertion unit may
    > insert into a text (Suggeested position: U+FFF8).
    >
    > [...]
    >
    > Of course a converter can still use U+FFFD if it has no
    > idea which character is intended or if unicode doesn't contain
    > the character.

    I remember reading on this list about a proposal to allocate 256 code points
    to represent the bytes of a non-Unicode character set which could not be
    converted to Unicode.

    What happened to that proposal? Was it ever formalized? If yes, was it
    refused?

    > The whole "charakter identities"-discussion gave me another
    > reason to introduce such a SFIS-charakter:
    > A font-renderer may show the SFIS before a charakter which
    > is replaced by another one [...]

    Sorry for repeating myself, but my opinion is that a renderer is *never*
    allowed to change one character to another. IMHO, all that discussion was
    about the shape of glyphs, not about changing characters.

    > I'd like to hear if my suggestion is completely weird or
    > if anybody else think it might be useful.

    One problem can be the nature of the code point which follows the "SFIS".

    Imagine that a stream, encoded in a certain character set, contains the byte
    0xBF and that this byte is undefined in that character set. Mapping the
    stream to Unicode, you convert 0xBF into a sequence of "SFIS" and U+00BF.
    Clearly, that U+00BF would just be a placeholder for the unknown byte, not
    an "INVERTED QUESTION MARK".

    The problem is that interpreting U+00BF as anything different from an
    "INVERTED QUESTION MARK" violates Unicode Conformance Requirement C7: "A
    process shall interpret a coded character representation according to the
    character semantics established by this standard, if that process does
    interpret that coded character representation."

    Another problem, more practical, is that if the unrecognized byte is in
    ranges 0x00..0x1F and 0x7F..0x9F, this would generate the code point of an
    Unicode control character, and this could have undesired effects. E.g.,
    U+0000 is often a string terminator; U+001B could trigger unexpected escape
    sequences, etc.

    _ Marco



    This archive was generated by hypermail 2.1.5 : Wed Oct 30 2002 - 10:23:45 EST