RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 07 2004 - 19:34:38 CST

  • Next message: John Cowan: "Re: OpenType not for Open Communication?"

    Lars,

    I'm going to step in here, because this argument seems to
    be generating more heat than light.

    > I never said it doesn't violate any existing rules. Stating that it does,
    > doesn't help a bit. Rules can be changed.

    > I ask you to step back and try to see the big picture.

    First, I'm going to summarize what I think Lars Kristan is
    suggesting, to test whether my understanding of the proposal
    is correct or not.

    I do not think this is a proposal to amend UTF-8 to allow
    invalid sequences. So we should get that off the table.

    What I think this suggestion is is for adding 128 characters
    to represent byte values in conversion to Unicode when the
    byte values are uninterpretable as characters. Why 128 instead
    of 256 I find a little mysterious, but presumably the intent
    is to represent 0x80..0xFF as raw, uninterpreted byte values,
    unconvertible to Unicode characters otherwise.

    This is suggested by Lars' use case of:

    > Storing UNIX filenames in a Windows database.

    ... since UNIX filenames are simply arrays of bytes, and cannot,
    on interconnected systems, necessarily be interpreted in terms
    of well-defined characters.

    Apparently Lars is currently using PUA U+E080..U+E0FF
    (or U+EE80..U+EEFF ?) for this purpose, enabling the round-tripping
    of byte values uninterpretable as characters to be converted, and
    is asking for standard Unicode values for this purpose, instead.

    The other use case that Lars seems to be talking about are
    existing documents containing data corruptions in them, which
    can often happen when Latin-1 data gets dropped into UTF-8 data
    or vice versa due to mislabeled email or whatever.

    > So you would drop the data. There are only two options with current designs.
    > Dropping invalid sequences, or storing it separately (which probably means
    > the whole document is dead until manually decoded). Dropping invalid
    > sequences is actually a better choice. And would even be justifiable (but
    > still sometimes inconvenient) if we were living in world where everything is
    > in UTF-8. In a world, trying to transition from legacy encodings to Unicode,
    > there could be a lot of data lost and a lot of angry users.

    And I am assuming this is referring primarily to the second case,
    where the extreme scenario Lars is envisioning would be, for
    example, where each point in a system was hyper-alert to
    invalid sequences and simply tossed or otherwise sequestered
    entire documents if they got these kinds of data corruptions
    in them. And in such a case, I can understand the concern about
    angry users. How many people on this list would be cursing if
    every bit of email that had a character set conversion error in
    it resulting in some bit hash or other, simply got tossed in the
    bit bucket instead of being delivered with the glorious hash
    intact, at least giving you the chance to see if you could
    figure out what was intended?

    > A UTF-16 based program will only be able to process valid UTF-8
    > data. A UTF-8 based program will in many cases preserve invalid sequences
    > even without any effort. Let me guess, you will say it is a flaw in the
    > UTF-8 based program. If validation is desired, yes. But then I think you
    > would want all UTF-8 based programs to do that. That will not happen. What
    > will happen is that UTF-8 based programs will be better text editors
    > (because they will not lose data or constantly complain), while UTF-16 based
    > programs will produce cleaner data. You will opt for the latter.

    This is, I think the basic point at which people are talking past each
    other.

    Notionally, Doug is correct that UTF-8 and UTF-16 are equivalent
    encoding forms, and anything represented (correctly) in one can
    be represented (correctly) in the other. In that sense, there is
    no difference between representation of text in UTF-8 or UTF-16,
    and no reason to postulate that a "UTF-8 based program" will have
    any advantages or disadvantages over a "UTF-16 based program" when
    it comes to dealing with corrupted data.

    What Lars is talking about is a broad class of UNIX-based software
    which is written to handle strings essentially as
    opaque bags of bytes, not caring what they contain for many
    purposes. Such software generally keeps working just fine if you
    pump UTF-8 at it, which is by design for UTF-8 -- precisely because
    UTF-8 leaves untouched all the 0x00..0x7F byte values that may
    have particular significance for those processes. Most of that
    software treats 0x80..0xFF just as bit hash from the get-go, and
    neither cares nor has any way of knowing if the particular
    sequence of bit hash is valid UTF-8 or Shift-JIS or Latin-1 or
    EUC-JIS or some mix or whatever.

    > And I for
    > the former. But will users know exactly what they've got? Will designers
    > know exactly what they're gonna get? This is where all this started. I
    > stated that there is an important difference between deciding for UTF-8 or
    > for UTF-16 (or UTF-32).

    This is where this is all getting derailed. Whatever the solutions
    for representation of corrupt data bytes or uninterpreted data
    bytes on conversion to Unicode may be, that is irrelevant to the
    concerns on whether an application is using UTF-8 or UTF-16
    or UTF-32.

    This has been like the Mišgarš (<== Latin-1 data corruption opportunity)
    Serpent, where things go around and around because the cosmic snake is
    holding its tail in its mouth. UTF-8 applications on UNIX are easy to
    write because they don't care about UTF-8 data corruption -- they
    keep working just fine. But then because such applications pass
    corrupted UTF-8 data around all the time, we have a legacy problem
    of ensuring the preservation of corrupted UTF-8 in documents. And
    furthermore, because such applications may corrupt data in other
    character encodings as well, we have to have means of preserving the
    data corruptions on conversion to UTF-8, so that we can roundtrip
    the data corruptions, as well as the data.

    That about it?

    > > Data stored in UTF-8 and UTF-16 and UTF-32 must remain completely
    > > interchangeable, from one encoding form to another. That is not
    > > negotiable.
    > (smiles) It should be.

    And here we apparently have the clash of conflicting worldviews.

    Unicode encoding forms represent code points. They are completely
    interconvertible, by *definition*, whether we are talking about
    encoded characters or unassigned code points.

    The UNIX world sees strings as uninterpreted byte streams, and
    sees the necessity of preserving the integrity of the byte
    stream, no matter what crazy process or user may have stuck
    some byte into the stream contrary to a charset definition.
    And it sees UTF-8 as simply one interpretation slapped on top
    of the fundamental construct of the byte stream.

    There's more to it, of course, but this is, I believe, as the
    bottom of the reason why, for 12 years now, people have been
    fundamentally misunderstanding each other about UTF-8.

    > Besides, surrogates are not completely interchangeable. Frankly, they are,
    > but do not need to be, right? Instead of using the PUA, I could have chosen
    > unpaired surrogates. But would risk that a UTF-16 validator drops them. The
    > 128 codepoints I am proposing definitely need to have a special status, like
    > the surrogates. And like I once said, UTF-16 got a big chunk of the BMP, and
    > a lot of exceptions. The same can be done for UTF-8. With only 128
    > codepoints.

    As stated, this sounds like nonsense to a Unicode standardizer.

    So let me try to restate it, and see if this is what Lars is
    actually after -- in a way that a Unicode standardizer could
    interpret.

    Say a process gets handed a "UTF-8" string that contains the
    byte sequence <61 62 63 93 4D D0 B0 E4 BA 8C F0 90 8C 82 94>.
                            ^^ ^^

    The 93 and 94 are just corrupt data -- it cannot be interpreted
    as UTF-8, and may have been introduced by some process that
    screwed up smart quotes from Code Page 1252 and UTF-8, for
    example. Interpreting the string, we have:

    <U+0061, U+0062, U+0063, ???, U+004D, U+0430, U+4E8C, U+10302, ???>
      
    Now *if* I am interpreting Lars correctly, he is using 128
    PUA code points to *validly* contain any such byte, so that
    it can be retained. If the range he is using is U+EE80..U+EEFF,
    then the string would be reinterpreted as:

    <U+0061, U+0062, U+0063, U+EE93, U+004D, U+0430, U+4E8C, U+10302, U+EE94>

    which in UTF-8 would be the byte sequence:

    <61 62 63 EE BA 93 4D D0 B0 E4 BA 8C F0 90 8C 82 EE BA 94>
              ^^^^^^^^ ^^^^^^^^

    This is now well-formed UTF-8, which anybody could deal with.
    And if you interpret U+EE93 as meaning "a placeholder for the
    uninterpreted or corrupt byte 0x93 in the original source",
    and so on, you could use this representation to exactly
    preserve the original information, including corruptions,
    which you could feed back out, byte-for-byte, if you reversed
    the conversion.

    Stated this way, at least I think the Unicode standardizers
    can understand what the proposal is aiming at -- if I haven't
    grossly misinterpreted it myself.

    > If you can guarantee that all data will be valid Unicode, then there would
    > be no need for the latter. And it's not arbitrary. It is about small
    > portions of data within otherwise valid UTF-8 data. Those can be legacy
    > encoded filenames, someone mistakenly inserting Latin 1 into a UTF-8
    > document, transmission errors, whatever. I think preserving data should be
    > possible. Programs that explicitly need to have clean data can validate,
    > drop or whatever. It's about the choice. Currently there isn't one.

    This is the statement from Lars that leads me to my interpretation,
    by the way. I think it fits the intent of what he was after.

    > > Characters don't get moved from PUA to BMP unless UTC assigns them
    > > there.
    > Yes, that is what I meant.

    Understood, I think.

    > And why do you think <99 C9> would become U+E000 and U+E001?! It's U+E099
    > and U+E0C9.
    > And no, my solution does not interprete UTF-8 correctly. Why should it.
    > Codepoints used for the roundtrip area are not supposed to be valid. They
    > are again stored as invalid sequences.
    >
    > And, it's not E0, it's EE, if anyone cares.

    From which I derive the above interpretation.

    > > I assure you, nobody will reject this scheme on the basis that it had
    > > not been considered before.
    > I am not so sure. Although, I am afraid somebody would try to reject it
    > because IT HAS been considered before. But has not been explained well
    > enough.

    Actually, what was considered before was a proposal to encode
    characters for byte values 0x00..0xFF, and in a somewhat different
    context than described here.

    Now moving from interpretation to critique, I think it unlikely
    that the UTC would actually want to encode 128 such characters
    to represent byte values -- and the reasons would be similar to
    those adduced for rejecting the earlier proposal. Effectively,
    in either case, these are proposals for enabling representation
    of arbitrary, embedded binary data (byte streams) in plain text.
    And that concept is pretty fundamentally antithetical to the
    Unicode concept of plain text.

    The response is likely to be to simply find another way around
    the problem, without trying to define maintenance of roundtrip
    integrity of unconvertible, corrupt string data as a *plain
    text* requirement.

    Storage of UNIX filenames on Windows databases, for example,
    can be done with BINARY fields, which correctly capture the
    identity of them as what they are: an unconvertible array of
    byte values, not a convertible string in some particular
    code page.

    As for the data corruption problem, the issue is simply how
    to deal with:

    <U+0061, U+0062, U+0063, ???, U+004D, U+0430, U+4E8C, U+10302, ???>

    in such a way as to preserve the source identity of the "???"
    on conversion, rather than mapping everything to U+FFFD, which
    loses information on roundtripping.

    Effectively, you just need a scheme for representing:

    "???(<0x93)" and "???(<0x94)"

    distinctly.

    In my opinion, trying to do that with a set of encoded characters
    (these 128 or something else) is *less* likely to solve the
    problem than using some visible markup convention instead.
    After all, that is what the various hex conventions already
    in use address, in part. In other words, I see little
    advantage to:

    <U+0061, U+0062, U+0063, U+EE93, U+004D, U+0430, U+4E8C, U+10302, U+EE94>

    (PUA), or:

    <U+0061, U+0062, U+0063, U+XX93, U+004D, U+0430, U+4E8C, U+10302, U+XX94>

    (standard on BMP), over

    <U+0061, U+0062, U+0063, "=93", U+004D, U+0430, U+4E8C, U+10302, "=94">

    with whatever escape you need in place to deal with your escape
    convention itself. In either case, the essential problem is
    getting applications to universally support the convention
    for maintaining and interpreting the corrupt bytes. Simply
    encoding 128 characters in the Unicode Standard ostensibly to
    serve this purpose is no guarantee whatsoever that anyone would
    actually implement and support them in the universal way you
    envision, any more than they might a "=93", "=94" convention.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 19:36:33 CST