RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Lars Kristan (
Date: Sat Dec 11 2004 - 09:49:13 CST

  • Next message: Lars Kristan: "RE: Roundtripping in Unicode"

    Kenneth Whistler wrote:
    > Lars responded:
    > > > ... Whatever the solutions
    > > > for representation of corrupt data bytes or uninterpreted data
    > > > bytes on conversion to Unicode may be, that is irrelevant to the
    > > > concerns on whether an application is using UTF-8 or UTF-16
    > > > or UTF-32.
    > > The important fact is that if you have an 8-bit based
    > program, and you
    > > provide a locale to support UTF-8, you can keep things
    > working (unless you
    > ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    > You can keep *some* things *sorta* working.
    I didn't say that this is all that needs to be done. But the way you say it
    makes one think that this is not even the right track.

    > > prescribe validation). But you cannot achieve the same if
    > you try to base
    > > your program on 16 or 32 bit strings.
    > Of course you can. You just have to rewrite the program to handle
    > 16-bit or 32-bit strings correctly. You can't pump them through
    > 8-bit pipes or char* API's, but it's just silly to try that, because
    > they are different animals to begin with.
    Correctly? Strings? There are no strings and no encodings in a UNIX
    filesystem. Please clarify.

    > By the way, I participated as an engineer in a multi-year project
    > that shifted an advanced, distributed data analysis system
    > from an 8-bit character set to 16-bit Unicode. *All*
    > user-visible string
    > processing was converted over -- and that included proprietary
    > file servers, comm servers, database gateways, networking code,
    > a proprietary 32-bit workstation GUI implementation, and a suite
    > of object-oriented application tools, including a spreadsheet,
    > plotting tool, query and database reporting tools, and much more.
    > It worked cross-platform, too.
    > It was completed, running, and *delivered* to customers in 1994,
    > a decade ago.
    OK, was this a fresh development, or was this an upgrade of an existing
    Did the existing system contain user data that needed to be converted?
    Was this data all in ASCII?
    Was this data all in a single code page?
    Latin 1 perhaps?
    How much of that data was in UTF-8?

    > You can't bamboozle me with any of this "it can't be done with
    > 16-bit strings" BS.

    BS? Bamboozle? One learns all sorts of new words here on this mailing list.
    Frankly, I find it interesting to read many historical and cultural facts in
    off-topic discussions, but I have a feeling I am not the only one and that
    many people prefer to engage in those. And that often original questions
    remain unanswered. And interesting ideas unexplored.

    I know it is hard to follow someone else's ideas, spread over many mails,
    already sidetracked by those who think they understand what is being
    discussed and by those who can't distinguish between following a standard
    and changing or extending it. In the end, statements torn out of context do
    in fact look as if they're nonsense.

    Much your response (in this particular mail, not in general) is just that.
    One misinterpretation after another. And detailed explanations of things
    that are not even being discussed. Non-conformances being pointed out, where
    consequences of proposed changes should in fact be discussed. I am
    disappointed by this attitude, even more so because it comes from one of the
    most respected people on this mailing list.

    > Yes you can.
    > No, you need not -- that is non-conformant, besides.
    > Utterly non-conformant.
    > Also utterly nonconformant.

    I suppose surrogates were also non-conformant at the time they were
    proposed. Can I interpret your responses that surrogates should never have
    been accepted into the Unicode standard?

    > I just don't understand these assertions at all.
    I have given plenty of examples.

    > First of all it isn't "UNIX data" or "Windows data" -- it is
    > end user's data, which happens to be processed in software
    > systems which in turn are running on a UNIX or Windows OS.
    This is resorting to a philosophical answer, picking on words.

    > I work for a company that *routinely* runs applications that
    > cross the platform barriers in all sorts of ways. It works
    > because character sets are handled conformantly, and conversions
    > are done carefully at platform boundaries -- not because some
    > hack has been added to UTF-8 to preserve data corruptions.
    Sybase, yes. A very controlled environment. The fact that validity of data
    *can* be guaranteed in your particular environment gives you not more, but
    less right to make judgements about other environments and claim the
    problems can be solved 'by doing things correctly'.

    > > If the purpose of Unicode is to to define bricks for plain
    > text, then what
    > > the hell are the surrogates doing in there?
    > This seems to represent a complete misunderstanding of the Unicode
    > character encoding forms.
    > This is completely equivalent to examining all the UTF-8 bytes
    > and then asking "what the hell are 0x80..0xF4 doing in there?"
    > And if you don't understand the analogy, then I submit that
    > you don't understand Unicode character encoding forms. Sorry.
    I was talking about the codepoints for surrogates, not their incarnation as
    'unsigned short' values in UTF-16. There are no codepoints for 0x80..0xF4.
    There is no analogy. Sorry.

    > > > Storage of UNIX filenames on Windows databases, for example,
    > > > can be done with BINARY fields, which correctly capture the
    > > > identity of them as what they are: an unconvertible array of
    > > > byte values, not a convertible string in some particular
    > > > code page.
    > > Sigh. Storing is just a start. Windows filenames are also
    > stored in the same
    > > database. And eventually, you need to have data from both
    > of them in the
    > > same output.
    > Then you need an application architecture that is sophisticated
    > enough to maintain character set state and deal with it correctly.
    > You can't just use 8-bit pipes, wave your hands, and assume that
    > it will all work out in the end.
    > > Or, for example, one might want to compare filenames from one
    > > platform with the filenames from the other. All this is
    > impossible in
    > > UTF-16.
    > Nonsense.
    You say UNIX filenames are binary data and cannot be mixed with text. Yes, I
    can understand that UNIX filenames shouldn't come anywhere near Windows
    filenames, this is indeed nonsense. Utter nonsense.


    This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 09:57:59 CST