RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Lars Kristan (
Date: Tue Dec 07 2004 - 12:43:13 CST

  • Next message: Doug Ewell: "Re: Unicode for words?"

    Doug Ewell replied:

    > Actually the Unicode Technical Committee. But you are
    > correct: it is up
    > to the UTC to decide whether they want to redefine UTF-8 to permit
    > invalid sequences, which are to be interpreted as unknown characters
    > from an unknown legacy coding standard, and to prohibit
    > conversion from
    > this redefined UTF-8 to other encoding schemes, or directly to Unicode
    > code points. We will have to wait and see what UTC members think of
    > this.
    I never said it doesn't violate any existing rules. Stating that it does,
    doesn't help a bit. Rules can be changed. Assuming we understand the
    consequences. And that is what we should be discussing. By stating what
    should be allowed and what should be prohibited you are again defending
    those rules. I agree, rules should be defended, but only up to a certain
    point. Simply finding a rule that is offended is not enough to prove
    something is bad or useless.

    > > But this decision should not be based solely on theory and ideal
    > > worlds.
    > Right. Uh-huh.
    Defining Unicode as the world of codepoints is a complex task on its own. It
    seems that you are afraid of stepping out of this world, since you do not
    know what awaits you there. So, it is easier to find an excuse within
    existing rules, especially if a proposed change threatens to shake
    everything right down to the foundation. If I would be dealing with Unicode
    (as we know it), I would probably be doing the same thing. I ask you to step
    back and try to see the big picture.

    > Of course not. That is not at all the same as INTENTIONALLY storing
    > invalid sequences in UTF-8 and expecting the decoding mechanism to
    > preserve the invalid bytes for posterity.
    So you would drop the data. There are only two options with current designs.
    Dropping invalid sequences, or storing it separately (which probably means
    the whole document is dead until manually decoded). Dropping invalid
    sequences is actually a better choice. And would even be justifiable (but
    still sometimes inconvenient) if we were living in world where everything is
    in UTF-8. In a world, trying to transition from legacy encodings to Unicode,
    there could be a lot of data lost and a lot of angry users.

    > And do what with it, Lars? Keep it on a shelf indefinitely
    > in case some
    > archaeologist unearths a new legacy encoding that might unlock the
    > mystery data?
    > Is this really worth the effort of redefining UTF-8 and
    > disallowing free
    > conversion between UTF-8 and Unicode code points?
    > Do you have a use case for this?
    Yes, I definitely have. I am the one accusing you of living in a perfect
    world, remember?. Do you think I would do that if I wasn't dealing with this
    problem in real life?

    > So with your plan, you have invalid sequence #1, invalid sequence #2,
    > and so forth. Now, what do the sequences mean? Is there any way to
    > interpret them? No, there isn't, because by definition these
    > sequences
    > represent characters from an unknown coding standard. Either
    > (a) nobody
    > has gone to the trouble to find out what characters they truly
    > represent, (b) the original standard is lost and we will *never* know,
    > or (c) we are waiting for the archaeologist to save the day.
    > In the meantime, the UTF-8 data with invalid sequences must be kept
    > isolated from all processes that would interpret the sequences as code
    > points, and raise an exception on invalid sequences-- in other words,
    > all existing processes that handle UTF-8.
    On the contrary. If those invalid sequences can (well, may) be translated
    into codepoints, then you can stop worrying about them. Or at least all the
    worrying is done within the conversion. It is the current design that is
    unfair. A UTF-16 based program will only be able to process valid UTF-8
    data. A UTF-8 based program will in many cases preserve invalid sequences
    even without any effort. Let me guess, you will say it is a flaw in the
    UTF-8 based program. If validation is desired, yes. But then I think you
    would want all UTF-8 based programs to do that. That will not happen. What
    will happen is that UTF-8 based programs will be better text editors
    (because they will not lose data or constantly complain), while UTF-16 based
    programs will produce cleaner data. You will opt for the latter. And I for
    the former. But will users know exactly what they've got? Will designers
    know exactly what they're gonna get? This is where all this started. I
    stated that there is an important difference between deciding for UTF-8 or
    for UTF-16 (or UTF-32).

    > > Let's compare UTF-8 to UTF-16 conversion to an automated translation
    > > from German to French. What Unicode standard says can be interpreted
    > > as follows:
    > >
    > > * All input text must be valid German language.
    > > * All output text must be valid French language.
    > > * Any unknown words shall be replaced by a (single) 'unknown word'
    > > placeholder.
    > If you have French words that cannot be translated into German at all,
    > and nobody in the target audience is capable of understanding French,
    > then what you have is an inscrutable collection of mystery
    > data, perhaps
    > suitable for research and examination by linguists, but not something
    > that the audience can make any sense of. In that case, converting all
    > the mystery data to a single "unknown word" placeholder is no
    > worse than
    > any other solution, and in particular, no worse than a solution that
    > converts 100 different mystery words into 100 different placeholders,
    > *none* of which the audience can decipher.
    Well, my solution does not involve mystery replacements. It's more like
    keeping the original word, with a small flyout saying "not translated". BTW,
    you have mixed up source and target. Or I don't understand what you're
    trying to say. And, why call it mystery data, it just means that it wasn't
    in the dictionary, not that there was a two weeks search for it by a team of
    experts. I'll keep the word as it is. Again, you can do that in UTF-8 (well,
    not currently proper UTF-8, but as invalid sequences), but cannot in UTF-16.

    > Data stored in UTF-8 and UTF-16 and UTF-32 must remain completely
    > interchangeable, from one encoding form to another. That is not
    > negotiable.
    (smiles) It should be.

    Besides, surrogates are not completely interchangeable. Frankly, they are,
    but do not need to be, right? Instead of using the PUA, I could have chosen
    unpaired surrogates. But would risk that a UTF-16 validator drops them. The
    128 codepoints I am proposing definitely need to have a special status, like
    the surrogates. And like I once said, UTF-16 got a big chunk of the BMP, and
    a lot of exceptions. The same can be done for UTF-8. With only 128

    > > Now suppose you have a UNIX filesystem, containing filenames in a
    > > legacy encoding (possibly even more than one). If one wants
    > to switch
    > > to UTF-8 filenames, what is one supposed to do? Convert all
    > filenames
    > > to UTF-8?
    > Well, yes. Doesn't the file system dictate what encoding it uses for
    > file names?
    No, it doesn't.

    > How would it interpret file names with "unknown"
    > characters
    > from a legacy encoding?
    Byte by byte.

    > How would they be handled in a directory
    > search?
    Byte by byte.

    > > Who will do that? And when? Will all users agree?
    > Agree about what? The conversion of characters from a legacy
    > character
    > set to Unicode? That's not up to the users; there are well-defined
    > conversion tables that take care of this, for virtually every legacy
    > character set you will ever run into. And for the ones we don't know,
    > we don't know.
    Now that you know there is no information about the encoding in the
    filesystem, you have a chance to reevaluate my question. Think biiiiig
    servers, with many users, many documents, some DOS encodings, some ISO
    encodings, a few others, then all this maybe somewhere where Latin 1 and
    Latin 2 overlap. Oh, and a delegation from Japan was also there for a couple
    of months. They'll be back in January.

    What shall we do? Convert everything to UTF-8? How? Keep everything as it is
    but start creating new files in UTF-8?

    > > Should all
    > > filenames that do not conform to UTF-8 be declared invalid?
    > If you have a UTF-8 file system, yes.
    > > And those files innacessible?
    > Those files should never have gotten into the file system in the first
    > place if they have invalid names. As I said before, how would you go
    > about accessing them? If your OS has a UTF-8 file system, it is
    > probably Unicode-based in general, so you would be using databases or
    > editors or whatever that work with Unicode characters.
    > Except, whoops,
    > this file name here contains an invalid UTF-8 sequence. That means it
    > contains one or more characters whose mapping to Unicode is unknown or
    > non-existent. Those files are ALREADY inaccessible.
    I'll just ignore this and wait for you to re-think, based on new evidence.

    > > If you keep all processing in UTF-8, then this is
    > > a decision you can postpone. But if you start using UTF-32
    > > applications for processing filenames, invalid sequences will be
    > > dropped and those files can in fact become inaccessible. And then
    > > you'll be wondering why users don't want to start using Unicode.
    > Right.
    > > I didn't encourage users to mix UTF-8 filenames and Latin 1
    > filenames.
    > > Do you want to discourage them?
    > Absolutely.
    I'll just ignore this and wait for you to re-think, based on new evidence.

    > And you want to apply this strategy to file names, where unambiguous
    > matching is critically important?
    > (leans toward door) Security!
    Again, you have it all wrong. Providing round trip capability for UTF-8
    solves the problem. Dropping invalid sequences is the one that creates
    opportunities for false matches.

    > > But actually data has not been lost in this case. It is however lost
    > > in all other cases, for example in NESTLÉ followed by a space. That
    > > is, if for whatever reason this byte sequence happens to be in a
    > > UTF-8 encoded text. Or if you named your file NESTLÉ.rtf and then
    > > switched your locale to UTF-8.
    > Why am I switching from locale A to locale B and expecting the files I
    > named under locale A to remain unchanged under locale B? Do I expect
    > that for other text? If so, why have locales?
    I was talking about when data is lost and when it isn't. I never proposed
    anything like what you are describing here.

    > There are plenty of tables available for converting data from other
    > standards to Unicode. ICU has hundreds of them. That is how Unicode
    > supports the processing of non-Unicode data, by defining how it can be
    > converted to Unicode.
    > Unicode is a standard for character encoding. It is not, *and should
    > not be*, a standard for storing arbitrary binary data.
    If you can guarantee that all data will be valid Unicode, then there would
    be no need for the latter. And it's not arbitrary. It is about small
    portions of data within otherwise valid UTF-8 data. Those can be legacy
    encoded filenames, someone mistakenly inserting Latin 1 into a UTF-8
    document, transmission errors, whatever. I think preserving data should be
    possible. Programs that explicitly need to have clean data can validate,
    drop or whatever. It's about the choice. Currently there isn't one.

    > >> What sort of "exception" is to be raised? What sort of "additional
    > >> caution" should the user take? What if this process is not
    > >> interactive, and contains no user intervention?
    > >
    > > Precisely the same questions I need to ask you for the process of
    > > converting the data. You are saying that data that cannot
    > be converted
    > > should not be converted.
    > Go back and read that last sentence again, please.
    > Done? Yes, that is what I am saying.
    Yes, I know you are. The next step is an exception and a manual
    intervention. Now go back read both original paragraphs and try to
    understand what I said. Triumphing over a sentence torn out of the context
    will not help.

    > > Then what if THIS process is not interactive? What if there is no
    > > manpower that could ever make it interactive? Simply resort to
    > > conversion with data loss?
    > You ALREADY HAVE data loss. You have character data that cannot be
    > converted to Unicode, which you are trying to fit into a Unicode
    > environment. Where is the practical difference between converting the
    > mystery "characters" to U+FEFF and converting them to invalid UTF-8
    > sequences that still cannot be interpreted?
    What data loss? Just a file with some Latin 1 characters. Anybody who
    understands the language can quickly guess the encoding that must be
    selected in order to display the file properly. Or convert it to Unicode.
    What I am saying is that you need to assume an automated process. And that
    you need to assume that nobody has the time to supervise it.

    > I'm just waiting for UTC members to join this thread and provide the
    > "proof" you need.

    > How do you propose to "index" data that you cannot identify?
    Identify? Anyway, I index data the way it indexes. What you put in is what
    you get. Much better than "some of what you put in you also get out". And
    the first couple of characters may be perfectly valid. You are assuming that
    it doesn't work. I know that it does. And indexing is not the only problem.
    It costs a lot to separate convertible data from non-convertible data. A lot
    of code, a lot of possibility for errors.

    > > It is the best (I hope) solution given the current limitations.
    > The "current limitations" are the way Unicode *and every
    > other character
    > encoding standard* are designed. It is how Unicode will
    > remain, unless,
    > as I said, UTC comes up with a real shocker for us all.
    It won't be the end of the world. The next day will be just as any other.

    > Again, I ask for a realistic use case which will demonstrate
    > that "this
    > problem" of uninterpretable legacy data is something that
    > Unicode should
    > assist in solving.
    Storing UNIX filenames in a Windows database.

    > > When that is realized, the logical next step is to get rid of the
    > > remaining problems. That is, to move from PUA to BMP.
    > Characters don't get moved from PUA to BMP unless UTC assigns them
    > there.
    Yes, that is what I meant.

    > > And bite the sour apple of security consequnces.
    > You think the "invalid UTF-8" scheme has fewer security consequences
    > than using the PUA?
    I think you still don't understand my scheme.

    > > Though, you have a serious security issue even with your design. Two
    > > strings may compare different in UTF-8, but same if converted to
    > > UTF-16. Yes, assuming they contained invalid sequences.
    > If I convert two different invalid UTF-8 sequences to the same Unicode
    > code point (U+FFFD), or otherwise raise the same error condition for
    > both, as directed by conformance clause C12a, then this is a serious
    > security issue with my design. Hmm, yes, I can see that.
    Was that sarcastic or.....

    > > So, you should validate any UTF-8 data before processing it. A nasty
    > > prerequisite. Thought all UTFs were equal...
    > They are. You have to validate any UTF sequence before processing it.
    > Data in UTF-16 may not contain surrogate code points, except where a
    > high surrogate is followed by a low surrogate (Definition
    > D35). Data in
    > UTF-32 may not contain surrogates, or any value above 0x10FFFF
    > (Definition D31). Nothing special about UTF-8 here.
    Programmers are struggling to _support_ Unicode. You can't realistically
    expect that they will now also validate all data. They won't even know where
    to start and where to end. Typically only conversion performs some kind of
    validation (sometimes only implicitly). Let me simply say that strict
    validation is the difference between ideal world and real world. And this
    validation will cut off a lot of exisiting data. And cannot be implemented
    efficiently. And cannot be guaranteed.

    > It provides matching for characters declared not to be convertible to
    > Unicode. Please provide a use case that demonstrates that
    > this problem
    > occurs in the real world, and needs to be solved within Unicode.
    I think I have. Can you, please, provide a description specific problems
    with my design? I mean other than that it violates certain rules, clauses or

    > So if you have, say, the invalid UTF-8 sequence <99 C9>, you
    > treat it as
    > two mystery characters and map them to U+E000 and U+E001. Then if you
    > have the valid UTF-8 sequences <EE 80 80> and <EE 80 81>, you also map
    > those to U+E000 and U+E001, because of course your solution still
    > interprets valid UTF-8 correctly.
    And why do you think <99 C9> would become U+E000 and U+E001?! It's U+E099
    and U+E0C9.
    And no, my solution does not interprete UTF-8 correctly. Why should it.
    Codepoints used for the roundtrip area are not supposed to be valid. They
    are again stored as invalid sequences.

    And, it's not E0, it's EE, if anyone cares.

    > Then if you want to convert these two examples back to UTF-8, you
    > convert both them to... what?
    > (stands up and looks down hall) SECURITY!!
    > > The other name for this is roundtripping. Currently,
    > Unicode allows a
    > > roundtrip UTF-16=>UTF-8=>UTF-16. For any data. But there are several
    > > reasons why a UTF-8=>UTF-16(32)=>UTF-8 roundtrip is more valuable,
    > > even if it means that the other roundtrip is no longer guaranteed:
    > Oh, good, now we have "UTF-16(32)". What is that?
    Oh, please.

    > Never mind Unicode. You have a lot to learn about data
    > structures, data
    > conversion, security, and standardization in general.
    No comment.

    > You are not really talking about UTF-32 so much as the whole idea of
    > code points. You think UTF-8 byte sequences *themselves*
    > should be the
    > representation of the character. This is, very simply, not the way it
    > is.
    You have long since lost the track of what I am talking about. I hope that
    not all readers have.

    > >> This is completely wrong-headed. The whole purpose of UTF-8 is to
    > >> represent valid Unicode code points, convertible to any other
    > >> encoding form.
    > >
    > > I agree. Again, this is a question of requirements. Current
    > > requirements have been met, no doubt about that. I am simply saying
    > > that new requirements should be added. Or at least considered. But
    > > fairly, not rejected simply because they were not considered before.
    > I assure you, nobody will reject this scheme on the basis that it had
    > not been considered before.
    I am not so sure. Although, I am afraid somebody would try to reject it
    because IT HAS been considered before. But has not been explained well

    > >> Sorry to be so harsh and sound so personal. Frankly, I am amazed
    > >> that in the 39 hours since your message was posted, this
    > mailing list
    > >> has not become absolutely flooded with responses explaining what a
    > >> truly bad idea this is.
    > >
    > > I hope it is because it is not as bad as you still think.
    > It is far, far worse than you think.
    Maybe they're playing divide'n'conquer.

    And, yes, you could try to be a little bit less harsh and try to sound a
    little bit less personal. I am trying myself.


    This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 12:45:19 CST