Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Doug Ewell (
Date: Wed Dec 08 2004 - 01:39:15 CST

  • Next message: Michael Everson: "Re: OpenType not for Open Communication?"

    RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
    Lars Kristan wrote:

    > I never said it doesn't violate any existing rules. Stating that it
    > does, doesn't help a bit. Rules can be changed. Assuming we understand
    > the consequences. And that is what we should be discussing. By stating
    > what should be allowed and what should be prohibited you are again
    > defending those rules. I agree, rules should be defended, but only up
    > to a certain point. Simply finding a rule that is offended is not
    > enough to prove something is bad or useless.

    In my opinion, these are rules that should not be broken or changed, NOT
    because changing the rules is inherently bad but because these
    particular changes would cause more problems than they would solve. In
    my opinion.

    > Defining Unicode as the world of codepoints is a complex task on its
    > own. It seems that you are afraid of stepping out of this world, since
    > you do not know what awaits you there. So, it is easier to find an
    > excuse within existing rules, especially if a proposed change
    > threatens to shake everything right down to the foundation. If I would
    > be dealing with Unicode (as we know it), I would probably be doing the
    > same thing. I ask you to step back and try to see the big picture.

    My objection to this has nothing to do with being some kind of
    conservative fuddy-duddy who is afraid to think outside the box.

    >> Do you have a use case for this?
    > Yes, I definitely have. I am the one accusing you of living in a
    > perfect world, remember?.

    Yes, I remember. Thank you.

    > Do you think I would do that if I wasn't dealing with this problem in
    > real life?

    The problem seems to be that you have file names in a Unix or Unix-like
    file system, where names are stored as uninterpreted bytes (thanks to
    everyone who pointed this out; I have learned something), and these
    bytes need to remain valid if the locale specifies UTF-8 and the bytes
    don't make a valid UTF-8 sequence. Right?

    How do file names work when the user changes from one SBCS to another
    (let's ignore UTF-8 for now) where the interpretation is different? For
    example, byte C3 is U+00C3, A with tilde (Ã) in ISO 8859-1, but U+0102,
    A with breve (Ă) in ISO 8859-2. If a file name contains byte C3, is its
    name different depending on the current locale? Is it accessible in all
    locales? (Not every SBCS defines a character at every code point.
    There's no C3 in ISO 8859-3, for example.)

    Does this work with MBCS other than UTF-8? I know you said other MBCS,
    like Shift-JIS, are not often used alongside other encodings except
    ASCII, but we can't guarantee that since we're not in a perfect world.
    :-) What if they were?

    If you have a UTF-8 locale, and file names that contain invalid UTF-8
    sequences, how would you address those files in a locale-aware way?
    This is similar to the question about the file with byte C3, which is Ã
    in one locale, Ă in another, and an unassigned code point in a third.

    > It is the current design that is unfair. A UTF-16 based program will
    > only be able to process valid UTF-8 data. A UTF-8 based program will
    > in many cases preserve invalid sequences even without any effort.

    I fear Ken is not correct when he says you are not arguing for the
    legalization of invalid UTF-8 sequences.

    > Let me guess, you will say it is a flaw in the UTF-8 based program.

    Good guess. Unicode and ISO/IEC 10646 say it is, and I say it is.

    > If validation is desired, yes. But then I think you would want all
    > UTF-8 based programs to do that. That will not happen. What will
    > happen is that UTF-8 based programs will be better text editors
    > (because they will not lose data or constantly complain), while UTF-16
    > based programs will produce cleaner data. You will opt for the latter.
    > And I for the former. But will users know exactly what they've got?
    > Will designers know exactly what they're gonna get? This is where all
    > this started. I stated that there is an important difference between
    > deciding for UTF-8 or for UTF-16 (or UTF-32).

    This isn't about UTF-8 versus other encoding forms. UTF-8-based
    programs will reject these invalid sequences because they don't map to
    code points, and because they are supposed to reject them.

    > BTW, you have mixed up source and target. Or I don't understand what
    > you're trying to say.

    You are right. I spoke of translating German to French, when the
    example was about going the other way. I made a mistake.

    > Besides, surrogates are not completely interchangeable. Frankly, they
    > are, but do not need to be, right?

    They are not completely. In UTF-8 and UTF-32, they are not allowed at
    all. In UTF-16, they may only occur in the proper context: a high
    surrogate may only occur before a low surrogate, and a low surrogate may
    only appear after a high surrogate. No other usage of surrogates is
    permitted, because if unpaired surrogates could be interpreted, the
    interpretation would be ambiguous.

    > Instead of using the PUA, I could have chosen unpaired surrogates. But
    > would risk that a UTF-16 validator drops them. The 128 codepoints I am
    > proposing definitely need to have a special status, like the
    > surrogates. And like I once said, UTF-16 got a big chunk of the BMP,
    > and a lot of exceptions. The same can be done for UTF-8. With only 128
    > codepoints.

    Whether you choose an illegal UTF-8 sequence or an illegal UTF-16
    sequence, the result is the same: a conformant process will not allow

    >> Well, yes. Doesn't the file system dictate what encoding it uses
    >> for file names?
    > No, it doesn't.

    I admit my error with regard to the handling of file names by Unix-style
    file systems, and I appreciate being set straight.

    >> Unicode is a standard for character encoding. It is not, *and should
    >> not be*, a standard for storing arbitrary binary data.
    > If you can guarantee that all data will be valid Unicode, then there
    > would be no need for the latter. And it's not arbitrary. It is about
    > small portions of data within otherwise valid UTF-8 data. Those can be
    > legacy encoded filenames, someone mistakenly inserting Latin 1 into a
    > UTF-8 document, transmission errors, whatever. I think preserving data
    > should be possible. Programs that explicitly need to have clean data
    > can validate, drop or whatever. It's about the choice. Currently there
    > isn't one.

    I think preserving transmission errors is carrying things too far. Your
    Unix file system already doesn't guarantee that; if a byte gets changed
    to 00 or 2F, you will have problems.

    As for "someone mistakenly inserting Latin 1 into a UTF-8 document,"
    this seems improbable. If the user is editing a UTF-8 document, she is
    probably using a UTF-8 editor, which loads and saves code points as
    UTF-8. The user doesn't enter the raw UTF-8 bytes <C9 99>; she enters
    the character <ə> and the editor takes care of the nuts and bolts of
    saving this as UTF-8.

    On the other hand, if the user is typing UTF-8 bytes directly into a
    non-UTF-8-aware editor, then of course anything is possible. But that
    seems like a bad way to live.

    > What data loss? Just a file with some Latin 1 characters. Anybody who
    > understands the language can quickly guess the encoding that must be
    > selected in order to display the file properly. Or convert it to
    > Unicode. What I am saying is that you need to assume an automated
    > process. And that you need to assume that nobody has the time to
    > supervise it.

    Now we're getting somewhere. We are no longer talking about a
    mysterious, unknown encoding in arbitrary text, but about file names
    known to be in Latin-1 instead of UTF-8. If the security risk is
    determined to be low, you *may* be able to get away with interpreting
    invalid UTF-8 as Latin-1. But in that case, the bytes need to be
    converted to real Unicode characters in the range U+0080..U+00FF, not to
    PUA characters, and they must not be written back as invalid UTF-8.

    >> You think the "invalid UTF-8" scheme has fewer security consequences
    >> than using the PUA?
    > I think you still don't understand my scheme.

    Maybe not. I think your scheme involves converting invalid UTF-8 to PUA
    code points in UTF-16, and back to invalid UTF-8. I'm saying the PUA
    part is sensible, and the invalid-UTF-8 part is not. (I know... only if
    I'm afraid to break some eggs...)

    >> If I convert two different invalid UTF-8 sequences to the same
    >> Unicode code point (U+FFFD), or otherwise raise the same error
    >> condition for both, as directed by conformance clause C12a, then
    >> this is a serious security issue with my design. Hmm, yes, I can
    >> see that.
    > Was that sarcastic or.....

    Yes, and I apologize for that. But I disagree that following the
    standard -- even if you think it is flawed -- constitutes a "serious
    security issue" in my design.

    > Programmers are struggling to _support_ Unicode. You can't
    > realistically expect that they will now also validate all data. They
    > won't even know where to start and where to end. Typically only
    > conversion performs some kind of validation (sometimes only
    > implicitly). Let me simply say that strict validation is the
    > difference between ideal world and real world. And this validation
    > will cut off a lot of exisiting data. And cannot be implemented
    > efficiently. And cannot be guaranteed.

    As a programmer, I will say:

    - Validating conversion is *part* of supporting Unicode, not a frill.
    - Validating conversion is one of the easiest parts of supporting
    Unicode, not a major source of struggle.
    - The standard is very clear about validation; there is no controversy
    over where to start and where to end.
    - Strict validation is required by the standard, and not that difficult.
    - Validation of conversion can be very efficient.
    - Validation of conversion from a well-defined charset is
    straightforward, and can easily be guaranteed.

    > Can you, please, provide a description specific problems with my
    > design? I mean other than that it violates certain rules, clauses or
    > whatever.

    Well, there's that. That's not trivial, is it?

    > And why do you think <99 C9> would become U+E000 and U+E001?! It's
    > U+E099 and U+E0C9.

    OK, whatever. It was just an example.

    > And no, my solution does not interprete UTF-8 correctly. Why should
    > it. Codepoints used for the roundtrip area are not supposed to be
    > valid. They are again stored as invalid sequences.

    So your solution not only involves the use of invalid UTF-8 sequences,
    it also does not interpret valid UTF-8 correctly.

    Ken, I'm sorry, but your faith was not borne out.

    > And, it's not E0, it's EE, if anyone cares.

    Doesn't change anything. U+EE99 is a perfectly valid Unicode code
    point, whose UTF-8 representation is <EE BA 99>. Failure to convert
    between the two is a fundamental lack of conformance. (Of course, if
    you're willing to think outside the box...)

    >> I assure you, nobody will reject this scheme on the basis that it had
    >> not been considered before.
    > I am not so sure. Although, I am afraid somebody would try to reject
    > it because IT HAS been considered before. But has not been explained
    > well enough.

    It will be accepted or rejected on the basis of its own merits or

    Why don't you write a proposal for this to the UTC? They may be able to
    provide you with a more satisfactory answer than I can. Be sure to be
    thorough in describing what you want.

    > And, yes, you could try to be a little bit less harsh and try to sound
    > a little bit less personal. I am trying myself.

    I apologize for my demeanor in this thread. It is not normally my

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 01:40:42 CST