RE: New Charakter Proposal

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 30 2002 - 19:11:55 EST

  • Next message: Eric Muller: "entering JIS 0213, HKSCS and GB 18030 characters"

    Dominikus Scherkl replied to Markus:

    > > > My other suggestion (and the main reason to call the proposed
    > > > charakter "source failure indicator symbol" (SFIS)) was intended
    > > > especaly for mall-formed utf-8 input that has overlong encodings.
    > > This is a special, custom form of error handling - why assign
    > > a character for it?
    > Converting from and to utf-8 is an all-day topic, very important
    > for all applications handling with unicode. So it is a special
    > case, but very common.
    > Therefore it would be nice to have a standardized - application
    > independend - error handling for it. Also it is a mechanism
    > useful for many other charsets beeing converted do unicode.

    I've got to agree with Markus here. Among other things, encoding
    a character which means "conversion failure occurred here" and
    then embedding it in converted text is just a generic and
    not very informative way of *representing* a conversion failure.
    The actual error handling would still end up being up to the
    application, every bit as much as what an application does
    today with a U+FFFD in Unicode text is application-specific.

    Adding this kind of character would then also complicate the
    task of people trying to figure out how to write convertors,
    since they would then be scratching their heads to distinguish
    between cases which warrant use of U+FFFD and those which
    warrant this new SFIS instead. Maybe the distinction seems clear
    to you, but I suspect that in practice people will become
    confused about the distinctions, and there will be troubling
    edge cases.

    In the particular case of UTF-8, I would consider such a
    mechanism nothing more than an attempted endrun around the
    tightened definition of UTF-8. It provides another path
    whereby ill-formed UTF-8 could get converted and then end
    up being interpreted by some process that doesn't know
    the difference. In other words, it carries the risk of
    reintroducing the security issue that we've been trying to
    get legislated away, by finding a way to make it "o.k." to
    interpret non-shortest UTF-8.

    > > You could just use an existing character or non-character for
    > > this, e.g., U+303E or U+FFFF or U+FDEF or similar.
    > This is what I do meanwhile. But it's uncomfortable, because
    > most editors display all non-characters, unassigned characters
    > or charakters not in the font all the same way - which hides
    > the INDICATION. The SFIS should be displayed to remind the reader
    > only THIS is a SFIS unlike all the other empty suqares in the
    > text.

    Your suggested encoding U+FFF8 wouldn't work this way, by the
    way. U+FFF8 is reserved for format control characters -- and
    those characters display *invisibly* by default -- not as
    an empty square (or other fallback glyph) like miscellaneous
    symbols which happen not to be in your fonts.

    I think Marku's suggestion is correct. If you want to do
    something like this internally to a process, use a noncharacter
    code point for it. If you want to have visible display of this
    kind of error handling for conversion, then simply declare a
    convention for the use of an already existing character.
    My suggestion would be: U+2620. ;-) Then get people to share
    your convention.

    I'm not intending to be facetious here, by the way. One problem
    that character encoding runs into is that there are plenty
    of people with good ideas for encoding meanings or functions,
    and those ideas can end up turning into requests to encode
    some invented character just for that meaning or function.
    For example, I might decide that it was a good idea to have
    a symbol by which I could mark a following date string as
    indicating a death date--that would be handy for bibliographies
    and other reference works. Now I could come to the Unicode
    Consortium and ask for encoding of U+XXXX DEATH DATE SYMBOL,
    or I could instead discover that U+2020 DAGGER is already used
    in that meaning for some conventions. There are *plenty* of
    symbol characters available in Unicode -- way more than in
    any other character encoding standard. And it is a much
    lighter-weight process to establish a convention for use
    of an existing symbol character than it is to encode a new character
    specifically for that meaning/function and then force everyone
    to implement it as a new character.

    > Additional I think we should have a standardized way to display
    > old utf-8 text without losing information (overlong utf-8 was
    > allowed for years)

    Not really. And in any case, there is nothing to be gained
    here by "displaying old utf-8 text without losing information".
    The way to deal with that is to *filter* it into legal
    UTF-8 text, by means of an explicit process designed to
    recover what would otherwise we rejected as illegal data.

    > - gyphing is not a fine way and simply
    > decoding the overlong forms is not allowed. This is a self-made
    > problem, so unicode should provide an inherent way to solve it.

    There are plenty of ways to solve these things -- by API design
    or by specialized conversions designed to deal with otherwise
    unrepresentable data. But trying to bake conversion error
    representation into the character encoding itself is, in
    my opinion, an error in itself.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Oct 30 2002 - 19:50:41 EST