Re: New Charakter Proposal

From: Mark Davis (mark.davis@jtcsv.com)
Date: Wed Oct 30 2002 - 09:46:21 EST

  • Next message: Marco Cimarosti: "RE: New Charakter Proposal"

    We had thought of something similar, but which would provide more
    information in interfaces.

    Reserve a space of 256 code points, with names:

    UNCONVERTIBLE BYTE-00
    UNCONVERTIBLE BYTE-01
    ...
    UNCONVERTIBLE BYTE-FF

    During a conversion process, if some bytes (say from corrupt UTF-8) cannot
    be correctly converted into code points, then a sequence of the above are
    generated. This doesn't preserve the original text -- you would never
    convert back from these codepoints to anything; it is really only useful
    ephemerally, in the process of doing a conversion where something goes
    wrong. It is really only a slightly more verbose FFFD REPLACEMENT, but would
    be handy in certain conversion APIs, expecially in
    single-code-point-at-a-time API like getChar().

    Mark
    __________________________________
    http://www.macchiato.com
    ► “Eppur si muove” ◄

    ----- Original Message -----
    From: "Dominikus Scherkl" <Dominikus.Scherkl@glueckkanja.com>
    To: <unicode@unicode.org>
    Sent: Wednesday, October 30, 2002 03:49
    Subject: New Charakter Proposal

    > Hello.
    >
    > I would like to have a "source failure indicator symbol" (SFIS)
    > charakter in the unicode, which a charset-convertion unit may
    > insert into a text (Suggeested position: U+FFF8).
    >
    > Reason:
    > several charsets have undefined codepoints which were
    > defined in a former or later version (eg. overlong
    > UTF-8 encodings or the $ symbol (0x24) in the INVARIANT
    > charset).
    >
    > A converter can replace such symbols by U+FFFD (which is
    > correct but loses the information), or simply use the
    > charakter which most likely is intended (which hides the error).
    > Both is not very good.
    >
    > The SFIS would allow the reader to see that an error occured
    > and therefore the following charakter may be incorrect, but
    > maintain the readability if the right conversion is made anyway
    > (or at least give a hint which charakter may be intended -
    > eg. the $ sign could have been any other currency symbol
    > if a national 7-bit charset was changed to INVARIANT by
    > previous conversions).
    >
    > Of course a converter can still use U+FFFD if it has no
    > idea which character is intended or if unicode doesn't contain
    > the character.
    >
    >
    > The whole "charakter identities"-discussion gave me another
    > reason to introduce such a SFIS-charakter:
    > A font-renderer may show the SFIS before a charakter which
    > is replaced by another one because the correct one is not
    > contained in the font (eg. it may render an "a with
    > superscript e above" by SFIS + "a umlaut" to indcate the
    > error and show an probably fitting replacement, which is
    > much better than to show an empty square).
    > In short words:
    > The SFIS may indicate a kind of compatibility-decomposition
    > of the following charakter.
    > (this is not nessessarily the standard compatibility-decomposition).
    >
    > I'd like to hear if my suggestion is completely weird or
    > if anybody else think it might be useful.
    >
    > Best Regards.
    > --
    > Dominikus Scherkl
    > dominikus.scherkl@glueckkanja.com
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Oct 30 2002 - 10:23:32 EST