RE: New Charakter Proposal

From: Dominikus Scherkl (
Date: Wed Oct 30 2002 - 10:46:36 EST

  • Next message: Alain LaBonté : "RE: Character identities"

    John Cowan wrote:
    > This sounds basically like an extension of U+303E IDEOGRAPHIC
    > VARIATION INDICATOR (whose semantic is: "The following character
    > is not what I want, but it's the best approximation I can get")
    > to non-ideographs.
    > I have no problem with this idea.

    So you mean: use U+303E + 'ä' to indicate that you would prefer
    the old-german form of that character if the font contains it?

    But that's no solution, because then you could directly use
    "a with superscript e above".

    What I thought of was a mechanism to display a character not
    found in the font by another char toghether with an indicator
    that shows the reader that it's not the real char but a replacement.
    But that's font-technology and therefore off topic.
    Please forget about that.

    My other suggestion (and the main reason to call the proposed
    charakter "source failure indicator symbol" (SFIS)) was intended
    especaly for mall-formed utf-8 input that has overlong encodings.

    In this special case a converter exactly knows which char is
    intended, but needs to put out an error to avoid ambiguities.
    In this case by now it MUST replace the overlong char by U+FFFD
    (or even cancel the conversion!).
    But I think SFIS + intended-char is a far better approach,
    because it
    1) warns the reader AND keeps the text readable
    2) distinguish overlong encodings from illegal char sequenzes.

    Especialy the second is of security interest, because
    overlong sequences are unlikely to occure unless introduced
    intentional (an old and buggy encoder or an attack) while
    illegal sequences are almost erroneous (cut stream, bit error
    or no utf-8 at all).

    For other source charsets this might be also useful but may
    cause problems - I have not realy thought over this in detail.
    But I think there are charsets which differ from others
    only in that they left several codepoints undefined while newer
    versions define them (eg. the euro-symbol).
    If there is a high probability that a specific character is
    intended, the SFIS mechanism is advantageous, I think.

    Best Regards.

    Dominikus Scherkl

    This archive was generated by hypermail 2.1.5 : Wed Oct 30 2002 - 11:26:44 EST