RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Wed Dec 08 2004 - 09:16:05 CST

  • Next message: Peter Kirk: "Re: Word dividers, was: proposals I wrote (and also, didn't write)"

    Kenneth Whistler wrote:
    > I'm going to step in here, because this argument seems to
    > be generating more heat than light.
    I agree, and I thank you for that.

    > First, I'm going to summarize what I think Lars Kristan is
    > suggesting, to test whether my understanding of the proposal
    > is correct or not.
    >
    > I do not think this is a proposal to amend UTF-8 to allow
    > invalid sequences. So we should get that off the table.
    At least until we all understand everything else about this issue.

    >
    > What I think this suggestion is is for adding 128 characters
    > to represent byte values in conversion to Unicode when the
    > byte values are uninterpretable as characters. Why 128 instead
    > of 256 I find a little mysterious, but presumably the intent
    > is to represent 0x80..0xFF as raw, uninterpreted byte values,
    > unconvertible to Unicode characters otherwise.
    Indeed, the full 256 codepoints could and perhaps even should be assigned
    for this purpose. The low 128 may in fact have a different purpose, and
    different handling. But I would delay this discussion also.

    >
    > This is suggested by Lars' use case of:
    >
    > > Storing UNIX filenames in a Windows database.
    >
    > ... since UNIX filenames are simply arrays of bytes, and cannot,
    > on interconnected systems, necessarily be interpreted in terms
    > of well-defined characters.
    >
    > Apparently Lars is currently using PUA U+E080..U+E0FF
    > (or U+EE80..U+EEFF ?) for this purpose, enabling the round-tripping
    > of byte values uninterpretable as characters to be converted, and
    > is asking for standard Unicode values for this purpose, instead.
    Yes.
    And, yes, it's U+EE80..U+EEFF.

    >
    > The other use case that Lars seems to be talking about are
    > existing documents containing data corruptions in them, which
    > can often happen when Latin-1 data gets dropped into UTF-8 data
    > or vice versa due to mislabeled email or whatever.
    Yes. One could argue that the need for the first use will gradually go away,
    that's why I also use this second example. Although, I think the first
    problem is underestimated. And is not limited to my example. And can have
    much more serious consequences. And might not go away anytime soon.

    > And I am assuming this is referring primarily to the second case,
    > where the extreme scenario Lars is envisioning would be, for
    > example, where each point in a system was hyper-alert to
    > invalid sequences and simply tossed or otherwise sequestered
    > entire documents if they got these kinds of data corruptions
    > in them. And in such a case, I can understand the concern about
    > angry users. How many people on this list would be cursing if
    > every bit of email that had a character set conversion error in
    > it resulting in some bit hash or other, simply got tossed in the
    > bit bucket instead of being delivered with the glorious hash
    > intact, at least giving you the chance to see if you could
    > figure out what was intended?
    The two aspects of the problem are not always clearly distinct. But yes,
    let's say it's the second one.

    I had the need to solve the first problem, not the second one. So some of
    what I say about this second one is somewhat theoretical. But also
    realistic, I hope. Or fear.

    >
    > This is, I think the basic point at which people are talking past each
    > other.
    >
    > Notionally, Doug is correct that UTF-8 and UTF-16 are equivalent
    > encoding forms, and anything represented (correctly) in one can
    > be represented (correctly) in the other. In that sense, there is
    > no difference between representation of text in UTF-8 or UTF-16,
    > and no reason to postulate that a "UTF-8 based program" will have
    > any advantages or disadvantages over a "UTF-16 based program" when
    > it comes to dealing with corrupted data.
    >
    > What Lars is talking about is a broad class of UNIX-based software
    > which is written to handle strings essentially as
    > opaque bags of bytes, not caring what they contain for many
    > purposes. Such software generally keeps working just fine if you
    > pump UTF-8 at it, which is by design for UTF-8 -- precisely because
    > UTF-8 leaves untouched all the 0x00..0x7F byte values that may
    > have particular significance for those processes. Most of that
    > software treats 0x80..0xFF just as bit hash from the get-go, and
    > neither cares nor has any way of knowing if the particular
    > sequence of bit hash is valid UTF-8 or Shift-JIS or Latin-1 or
    > EUC-JIS or some mix or whatever.
    Yes. With a couple of additions.

    It is not true that most of that software doesn't care about the encoding.
    Copy or cat really don't need to, but more does, to count the lines properly
    (needs to know the number of outputted glyphs or whatever they are, in order
    to know where line breaks will occur). If it is told that the console will
    interprete the stream as UTF-8, then it must process the data accordingly.

    And it should do so as best as it can. Dropping invalid sequences is
    questionable. You could say that they won't be printable anyway, so it
    doesn't matter if you replace them with U+FFFD in the program or let the
    console do it before output. Perhaps true for 'more', since it makes little
    sense to use it for anything but output to a console. For other processing,
    dropping is not in line with UNIX concepts. And even for output to console.
    If both the 'more' and the console were enabled for invalid sequences, then
    really, it doesn't matter that much which of the two replaces them with the
    replacement codepoints (but if the pipe is 8-bit, then 'more' really can't).
    And the output need not be just squares. It could be a square containing a
    tiny hex code, or a tiny Latin 1 character (or even any other encoding,
    possibly even configurable, though this only works for SBCS).

    >
    > > And I for
    > > the former. But will users know exactly what they've got?
    > Will designers
    > > know exactly what they're gonna get? This is where all this
    > started. I
    > > stated that there is an important difference between
    > deciding for UTF-8 or
    > > for UTF-16 (or UTF-32).
    >
    > This is where this is all getting derailed. Whatever the solutions
    > for representation of corrupt data bytes or uninterpreted data
    > bytes on conversion to Unicode may be, that is irrelevant to the
    > concerns on whether an application is using UTF-8 or UTF-16
    > or UTF-32.
    The important fact is that if you have an 8-bit based program, and you
    provide a locale to support UTF-8, you can keep things working (unless you
    prescribe validation). But you cannot achieve the same if you try to base
    your program on 16 or 32 bit strings. Or, again, you really cannot with 16
    bit (UTF-16), and you sort of can with 32 bit (UTF-32), but must resort to
    values above 21 bits. Again, nothing standardized there, nothing defined for
    how functions like isspace should react and so on.

    It may seem at first that it can be achieved with UTF-16. Maybe, if that
    intermediate data never leaves the program. But that is rarely true. It is
    true for 'more', assuming pipes will remain 8-bit (and I have reasons to
    believe they will). But it is not true even for a simple text editor, unless
    it only allows saving in the same mode as it opened the file. I would expect
    a "save as UTF-16" capability, especially if it is based on UTF-16
    internally, right? Plus all the problems you would have with UTF-32.

    >
    > This has been like the Mišgarš (<== Latin-1 data corruption
    > opportunity)
    > Serpent, where things go around and around because the cosmic
    > snake is
    > holding its tail in its mouth. UTF-8 applications on UNIX are easy to
    > write because they don't care about UTF-8 data corruption -- they
    > keep working just fine. But then because such applications pass
    > corrupted UTF-8 data around all the time, we have a legacy problem
    > of ensuring the preservation of corrupted UTF-8 in documents. And
    > furthermore, because such applications may corrupt data in other
    > character encodings as well, we have to have means of preserving the
    > data corruptions on conversion to UTF-8, so that we can roundtrip
    > the data corruptions, as well as the data.
    >
    > That about it?
    I guess. And it's about the fact that it is far more likely that this
    happens to UTF-8 data (or that some legacy data is mistakenly labelled or
    assumed to be UTF-8).
    UTF-16 data is far cleaner than 8-bit data. Basically because you had to
    know the encoding in order to store the data in UTF-16. Now, you can say
    that is precisely what I will break and it will no longer be true. Not
    really. UTF-16 will not be mixed with arbitrary 16-bit data. All the chaos
    that will be introduced will be limited to the 128 (or perhaps 256)
    codepoints. And that can be kept under control. That is my assumption. And I
    think one should prove otherwise, not simply state they fear it cannot be.

    >
    > > > Data stored in UTF-8 and UTF-16 and UTF-32 must remain completely
    > > > interchangeable, from one encoding form to another. That is not
    > > > negotiable.
    > > (smiles) It should be.
    >
    > And here we apparently have the clash of conflicting worldviews.
    >
    > Unicode encoding forms represent code points. They are completely
    > interconvertible, by *definition*, whether we are talking about
    > encoded characters or unassigned code points.
    Surrogates are not. There's a bunch of special rules for them. And the same
    would be needed for the codepoints I am proposing.

    >
    > The UNIX world sees strings as uninterpreted byte streams, and
    > sees the necessity of preserving the integrity of the byte
    > stream, no matter what crazy process or user may have stuck
    > some byte into the stream contrary to a charset definition.
    > And it sees UTF-8 as simply one interpretation slapped on top
    > of the fundamental construct of the byte stream.
    Is that good or bad? Even if one thinks it is bad, should we try to prevent
    it? Is it OK to obstruct it?

    The funny thing is, I am talking so much about UNIX that one would assume
    that I am a UNIX developer. And trying to solve problems I have on UNIX. But
    it's the other way around.

    UTF-8 is what solved the problems on UNIX. It allowed UNIX to process
    Windows data. Alongside its own.
    It is Windows that has problems now. And I think roundtripping is the
    solution that will allow Windows to process UNIX data. Without dropping data
    or raising exceptions. Alongside its own.

    >
    > There's more to it, of course, but this is, I believe, as the
    > bottom of the reason why, for 12 years now, people have been
    > fundamentally misunderstanding each other about UTF-8.
    Is it 12? Thought it was far less. Off topic, when was UTF-8 added to
    Unicode standard?

    >
    > > Besides, surrogates are not completely interchangeable.
    > Frankly, they are,
    > > but do not need to be, right? Instead of using the PUA, I
    > could have chosen
    > > unpaired surrogates. But would risk that a UTF-16 validator
    > drops them. The
    > > 128 codepoints I am proposing definitely need to have a
    > special status, like
    > > the surrogates. And like I once said, UTF-16 got a big
    > chunk of the BMP, and
    > > a lot of exceptions. The same can be done for UTF-8. With only 128
    > > codepoints.
    >
    > As stated, this sounds like nonsense to a Unicode standardizer.
    Hence my pleeds to think outside the box.

    >
    > So let me try to restate it, and see if this is what Lars is
    > actually after -- in a way that a Unicode standardizer could
    > interpret.
    >
    > Say a process gets handed a "UTF-8" string that contains the
    > byte sequence <61 62 63 93 4D D0 B0 E4 BA 8C F0 90 8C 82 94>.
    > ^^ ^^
    >
    > The 93 and 94 are just corrupt data -- it cannot be interpreted
    > as UTF-8, and may have been introduced by some process that
    > screwed up smart quotes from Code Page 1252 and UTF-8, for
    > example. Interpreting the string, we have:
    >
    > <U+0061, U+0062, U+0063, ???, U+004D, U+0430, U+4E8C, U+10302, ???>
    >
    > Now *if* I am interpreting Lars correctly, he is using 128
    > PUA code points to *validly* contain any such byte, so that
    > it can be retained. If the range he is using is U+EE80..U+EEFF,
    > then the string would be reinterpreted as:
    >
    > <U+0061, U+0062, U+0063, U+EE93, U+004D, U+0430, U+4E8C,
    > U+10302, U+EE94>
    >
    > which in UTF-8 would be the byte sequence:
    >
    > <61 62 63 EE BA 93 4D D0 B0 E4 BA 8C F0 90 8C 82 EE BA 94>
    > ^^^^^^^^ ^^^^^^^^
    >
    > This is now well-formed UTF-8, which anybody could deal with.
    > And if you interpret U+EE93 as meaning "a placeholder for the
    > uninterpreted or corrupt byte 0x93 in the original source",
    > and so on, you could use this representation to exactly
    > preserve the original information, including corruptions,
    > which you could feed back out, byte-for-byte, if you reversed
    > the conversion.
    >
    > Stated this way, at least I think the Unicode standardizers
    > can understand what the proposal is aiming at -- if I haven't
    > grossly misinterpreted it myself.

    Quite close. Except for the fact that:
    * U+EE93 is represented in UTF-32 as 0x0000EE93
    * U+EE93 is represented in UTF-16 as 0xEE93
    * U+EE93 is represented in UTF-8 as 0x93 (_NOT_ 0xEE 0xBA 0x93)

    Which could be understood as "a proposal to amend UTF-8 to allow invalid
    sequences".

    Suppose a range from BMP is assinged for this purpose. Existing conversion
    functions will no longer be conformant. That's not very nice, but can be
    solved. And it is not unlikely that further research of this proves that the
    two conversions CAN coexist, one for programs that want to validate, and one
    for programs that do not want to validate. And, yes, what you wrote can be
    understood in the same way. Except that I favor undoing to using the
    now-standard conversion.

    Suppose unpaired surrogates are in fact legalized for this purpose. It would
    probably clash with other quasi-standards outside Unicode standard (I forgot
    what that UCS2 to UTF-8something conversion is referred to as). Also,
    existing validators could reject the unpaired surrogates, and there is a
    risk of concatenating unpaired surrogates. Although, perhaps this could be
    solved by filtering the unpaired high, and only allowing unpaired low.

    One advantage of using codepoints over using unpaired surrogates is that the
    replacement codepoints can easily be identified. One can alert the user,
    mark them red, add flyouts, replace them with U+FFFD or reject such data
    unconditionally. There are use cases for any of those. And like I said, it
    can be done when the data is retrieved, while with current approach, it can
    only be done when data is being stored, with fewer options to chose from.

    >
    > Actually, what was considered before was a proposal to encode
    > characters for byte values 0x00..0xFF, and in a somewhat different
    > context than described here.
    >
    > Now moving from interpretation to critique, I think it unlikely
    > that the UTC would actually want to encode 128 such characters
    > to represent byte values -- and the reasons would be similar to
    > those adduced for rejecting the earlier proposal. Effectively,
    > in either case, these are proposals for enabling representation
    > of arbitrary, embedded binary data (byte streams) in plain text.
    > And that concept is pretty fundamentally antithetical to the
    > Unicode concept of plain text.
    From the perspective of plain text, yes, roundtrip for invalid sequences in
    UTF-8 has nothing to do with it. It would be great if there was no need for
    it. Storing arbitrary binary data would then be just a proposal for a thing
    that doesn't belong in Unicode. The proposed codepoints for rountripping can
    indeed be misused for (or misinterpreted as) storing binary data. But this
    fact does not constitute an argument against them.

    >
    > The response is likely to be to simply find another way around
    > the problem, without trying to define maintenance of roundtrip
    > integrity of unconvertible, corrupt string data as a *plain
    > text* requirement.
    If the purpose of Unicode is to to define bricks for plain text, then what
    the hell are the surrogates doing in there? Why not tell developers to find
    another way around the problem?

    >
    > Storage of UNIX filenames on Windows databases, for example,
    > can be done with BINARY fields, which correctly capture the
    > identity of them as what they are: an unconvertible array of
    > byte values, not a convertible string in some particular
    > code page.
    Sigh. Storing is just a start. Windows filenames are also stored in the same
    database. And eventually, you need to have data from both of them in the
    same output. Or, for example, one might want to compare filenames from one
    platform with the filenames from the other. All this is impossible in
    UTF-16.

    Again, it IS possible with UTF-8, assuming no validation is enforced, and
    with some additional caution. So, I can do it in UTF-8 and I need all sorts
    of workarounds in UTF-16. From a viewpoint of a Unicode standardizer,
    they're equally suitable. From my viewpoint, they are not. Do we really live
    in two different worlds? Should we?

    >
    > As for the data corruption problem, the issue is simply how
    > to deal with:
    >
    > <U+0061, U+0062, U+0063, ???, U+004D, U+0430, U+4E8C, U+10302, ???>
    >
    > in such a way as to preserve the source identity of the "???"
    > on conversion, rather than mapping everything to U+FFFD, which
    > loses information on roundtripping.
    >
    > Effectively, you just need a scheme for representing:
    >
    > "???(<0x93)" and "???(<0x94)"
    >
    > distinctly.
    >
    > In my opinion, trying to do that with a set of encoded characters
    > (these 128 or something else) is *less* likely to solve the
    > problem than using some visible markup convention instead.
    > After all, that is what the various hex conventions already
    > in use address, in part. In other words, I see little
    > advantage to:
    >
    > <U+0061, U+0062, U+0063, U+EE93, U+004D, U+0430, U+4E8C,
    > U+10302, U+EE94>
    >
    > (PUA), or:
    >
    > <U+0061, U+0062, U+0063, U+XX93, U+004D, U+0430, U+4E8C,
    > U+10302, U+XX94>
    >
    > (standard on BMP), over
    >
    > <U+0061, U+0062, U+0063, "=93", U+004D, U+0430, U+4E8C,
    > U+10302, "=94">
    >
    > with whatever escape you need in place to deal with your escape
    > convention itself. In either case, the essential problem is
    First, I am glad you are not proposing this approach for my problem. There
    is a concern with size there, which is why I used the PUA from BMP and not
    the other one (although it would be safer, perhaps). And why I am speaking
    of defining these codepoints in BMP.

    OK, let's take a look at escaping. It works fine if there are few errors and
    if the intent is to read a document. A self descriptive escape would then be
    suitable.

    It wouldn't work well if there are many errors. The text would lose its
    original form. On your screen, instead of three paragraphs, you would only
    get perhaps half of one. The thing becomes unreadable and readable parts get
    lost in the escapes (look at how much damage are causing the line breaks and
    '>' marks when replying to mails).

    Once escaped, the text will probably remain escaped. This is in line with
    what you expected, that I convert to codepoints and then represent them
    'properly' in UTF-8. But that is not roundtripping.

    But un-escaping would be a dangerous thing. Even if it was standardized. Any
    existing text can contain those escape sequences. The more simple, the more
    likely. On the other hand, using codepoints that were not assigned before
    has an implicit advantage here.

    Last but not least - if a certain approach solves two problems at once, then
    one should raise the eyebrows rather than try to find ways to solve each
    problem separately. And the 'single approach solution' also has a better
    chance to be accepted. Especially if most of it is contained in the
    conversion, rather than requiring a lot of design and development effort.

    > getting applications to universally support the convention
    > for maintaining and interpreting the corrupt bytes. Simply
    > encoding 128 characters in the Unicode Standard ostensibly to
    > serve this purpose is no guarantee whatsoever that anyone would
    > actually implement and support them in the universal way you
    > envision, any more than they might a "=93", "=94" convention.
    Are you really saying that whatever is standardized has no better chance of
    being used than anything else? Can this really be used as a
    counter-argument?

    Lars



    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 09:21:35 CST