Re: Nicest UTF

From: Doug Ewell (
Date: Sun Dec 05 2004 - 01:50:03 CST

  • Next message: Doug Ewell: "Re: latin equivalent to specific indian characters"

    RE: Nicest UTFLars Kristan wrote:

    >> I think UTF8 would be the nicest UTF.
    > I agree. But not for reasons you mentioned. There is one other
    > important advantage: UTF-8 is stored in a way that permits storing
    > invalid sequences. I will need to elaborate that, of course.

    I could not disagree more with the basic premise of Lars' post. It is a
    fundamental and critical mistake to try to "extend" Unicode with
    non-standard code unit sequences to handle data that cannot be, or has
    not been, converted to Unicode from a legacy standard. This is not what
    any character encoding standard is for.

    > 1.2 - Any data for which encoding is not known can only be stored in a
    > UTF-16 database if it is converted. One needs to choose a conversion
    > (say Latin-1, since it is trivial). When a user finds out that the
    > result is not appealing, the data needs to be converted back to the
    > original 8-bit sequence and then the user (or an algorithm) can try
    > various encodings until the result is appealing.

    This is simply what you have to do. You cannot convert the data into
    Unicode in a way that says "I don't know how to convert this data into
    Unicode." You must either convert it properly, or leave the data in its
    original encoding (properly marked, preferably).

    It is just as if a German speaker wanted to communicate a word or phrase
    in French that she did not understand. She could find the correct
    German translation and use that, or she could use the French word or
    phrase directly (moving the translation burden onto the listener). What
    she cannot do is "extend" German by creating special words that are
    placeholders for French words whose meaning she does not know.

    > 2.2 - Any data for which encoding is not known can simply be stored
    > as-is.

    NO. Do not do this, and do not encourage others to do this. It is not
    valid UTF-8.

    Among other things, you run the risk that the mystery data happens to
    form a valid UTF-8 sequence, by sheer coincidence. The example of
    "NESTLÉ™" in Windows CP1252 is applicable here. The last two bytes are
    C9 99, a valid UTF-8 sequence for U+0259. By applying the concept of
    "adaptive UTF-8" (as Dan Oscarsson called it in 1998), this sequence
    would be interpreted as valid UTF-8, and data loss would occur.

    > 2.4 - Any data that was stored as-is may contain invalid sequences,
    > but these are stored as such, in their original form. Therefore, it is
    > possible to raise an exception (alert) when the data is retrieved.
    > This warns the user that additional caution is needed. That was not
    > possible in 1.4.

    This is where the fatal mistake is made. No matter what Unicode
    encoding form is used, its entire purpose is to encode *Unicode code
    points*, not to implement a two-level scheme that supports both Unicode
    and non-Unicode data. What sort of "exception" is to be raised? What
    sort of "additional caution" should the user take? What if this process
    is not interactive, and contains no user intervention?

    > 3.1 - Unfortunately we don't live in either of the two perfect worlds,
    > which makes it even worse. A database on UNIX will typically be (or
    > can be made to be) 8-bit. Therefore perfectly able to handle UTF-8
    > data. On Windows however, there is a lot of support for UTF-16, but
    > trying to work in UTF-8 could prove to be a handicap, if not close to
    > impossible.

    UTF-8 and UTF-16, used correctly, are perfectly interchangeable. It is
    not in any way a fault of UTF-16 that it cannot be used to store
    arbitrary binary data.

    > 3.3 - For the record: other UTF formats CAN be made equally useful to
    > UTF-8. It requires 128 codepoints. Back in 2002, I have tried to
    > convince people on the Unicode mailing list that this should be done,
    > but have failed.

    Because it is an incredibly bad idea.

    > I am now using the PUA for this purpose. And I am even tempted to hope
    > nobody will never realize the need for these 128 codepoints, because
    > then all my data will be non-standard.

    You *should* use the PUA for this purpose. It is an excellent
    application of the PUA. But do not be surprised if someone else,
    somewhere, decides to use the same 128 PUA code points for some other
    purpose. That does not make your data "non-standard," because all PUA
    data, by definition, is "non-standard." What you are doing with the PUA
    is far more standard, and far more interoperable, than writing invalid
    UTF-8 sequences and expecting parsers to interpret them as "undeciphered
    8-bit legacy text of some sort."

    > 4.1 - UTF-32 is probably very useful for certain string operations.
    > Changing case for example. You can do it in-place, like you could
    > with ASCII. Perhaps it can even be done in UTF-8, I am not sure. But
    > even if it is possible today, it is definitely not guaranteed that it
    > will always remain so, so one shouldn't rely on it.

    Not only is this not 100% true, as others have pointed out, but it is
    completely irrelevant to your other points.

    > 4.2 - But UTF-8 is superior. You can make UTF-8 functions ignore
    > invalid sequences and preserve them. But as soon as you convert UTF-8
    > to anything else, problems begin. You cannot preserve invalid
    > sequences if you convert to UTF-16 (except by using unpaired
    > surrogates). You can preserve invalid sequences when converting to
    > UTF-32, but this again means you need to use undefined values (above
    > 21 bits) in addition to modifying the functions so they do not modify
    > these values. But then again, if one is to use these values, then they
    > should be standardized. If so, why use the hyper-values, why not have
    > them in Unicode?

    This is completely wrong-headed. The whole purpose of UTF-8 is to
    represent valid Unicode code points, convertible to any other encoding

    > 5.1 - One could say that UTF-8 is inferior, because it has invalid
    > sequences to start with. But UTF-16 and UTF-32 also have invalid
    > sequences and/or values. The beauty of UTF-8 is that it can coexist
    > with legacy 8-bit data. One is tempted to think that all we need is to
    > know what is old and what is new and that this is also a benefit on
    > its own. But this assumption is wrong. You will always come across
    > chunks of data without any external attributes. And isn't that what
    > 'plain text' is all about? To be plain and self contained. Stateless.
    > Is UTF-16 stateless, if it needs the BOM? Is UTF-32LE stateless if we
    > need to know that it is UTF-32LE? Unfortunately we won't be able to
    > get rid of them. But I think they should not be used in data exchange.
    > And not even for storage, wherever possible. That is what I see as a
    > long term goal.

    I don't even know where to begin with this rant.

    Is it really "stateless" to store textual data in some mystery encoding,
    such that we don't know what it is, but we insist on storing it anyway,
    undeciphered, like some relic from an archaeological dig? What is the
    use of such data? Can it really be considered plain text at all?

    Is it "stateless" to redefine UTF-8 so that <C9 99> is a two-byte
    sequence that means U+0259, but <99 C9> is two characters in an unknown
    legacy encoding? How is this any more "stateless" than the UTF-32LE
    example above?

    Sorry to be so harsh and sound so personal. Frankly, I am amazed that
    in the 39 hours since your message was posted, this mailing list has not
    become absolutely flooded with responses explaining what a truly bad
    idea this is.

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 01:53:55 CST