RE: Nicest UTF

From: Lars Kristan (lars.kristan@hermes.si)
Date: Mon Dec 06 2004 - 08:10:06 CST

  • Next message: Peter Constable: "RE: OpenType not for Open Communication?"

    Doug Ewell wrote:
    > RE: Nicest UTFLars Kristan wrote:
    >
    > >> I think UTF8 would be the nicest UTF.
    > >
    > > I agree. But not for reasons you mentioned. There is one other
    > > important advantage: UTF-8 is stored in a way that permits storing
    > > invalid sequences. I will need to elaborate that, of course.
    >
    > I could not disagree more with the basic premise of Lars'
    > post. It is a
    > fundamental and critical mistake to try to "extend" Unicode with
    > non-standard code unit sequences to handle data that cannot be, or has
    > not been, converted to Unicode from a legacy standard. This
    > is not what
    > any character encoding standard is for.
    What a standard is or is not for is a decision. And Unicode consortium is
    definitely the body that makes the decision in this case. But this decision
    should not be based solely on theory and ideal worlds.

    >
    > > 1.2 - Any data for which encoding is not known can only be
    > stored in a
    > > UTF-16 database if it is converted. One needs to choose a conversion
    > > (say Latin-1, since it is trivial). When a user finds out that the
    > > result is not appealing, the data needs to be converted back to the
    > > original 8-bit sequence and then the user (or an algorithm) can try
    > > various encodings until the result is appealing.
    >
    > This is simply what you have to do. You cannot convert the data into
    > Unicode in a way that says "I don't know how to convert this data into
    > Unicode." You must either convert it properly, or leave the
    > data in its
    > original encoding (properly marked, preferably).
    Here lies the problem. Suppose you have a document in UTF-8, which somehow
    got corrupted and now contains a single invalid sequence. Are you proposing
    that this document needs to be stored separately? Everything else in the
    database would be stored in UTF-16, but now one must add the capability to
    store this document separately. And probably not index it. Regardless of any
    useful data in it. But if you use UTF-8 storage instead, you can put it in
    with the rest (if you can mark it, even better, but you only need to do it
    if that is a requirement).

    >
    > It is just as if a German speaker wanted to communicate a
    > word or phrase
    > in French that she did not understand. She could find the correct
    > German translation and use that, or she could use the French word or
    > phrase directly (moving the translation burden onto the
    > listener). What
    > she cannot do is "extend" German by creating special words that are
    > placeholders for French words whose meaning she does not know.
    I can reinterprete your example. Using the French word is exactly the
    solution I am proposing, and I see your solution is to replace the word with
    a placeholder which says "a word that does not exist in German". Even worse,
    you want to use the same placeholder for all the unknown words. Numbering
    them would be better, but awkward, since you don't know how to assign
    numbers. Fortunetely, with bytes in invalid sequences, the numbering is
    trivial and has a meaning.

    Let's compare UTF-8 to UTF-16 conversion to an automated translation from
    German to French. What Unicode standard says can be interpreted as follows:
    * All input text must be valid German language.
    * All output text must be valid French language.
    * Any unknown words shall be replaced by a (single) 'unknown word'
    placeholder.

    And that last statement goes for German words missing in your dictionary,
    misspelled words, Spanish words, proper nouns...

    >
    > > 2.2 - Any data for which encoding is not known can simply be stored
    > > as-is.
    >
    > NO. Do not do this, and do not encourage others to do this.
    > It is not
    > valid UTF-8.
    I never said it is valid UTF-8. The fact remains I can store legacy data in
    the same store as UTF-8 data. But cannot do that if storage is UTF-16 based.

    Now suppose you have a UNIX filesystem, containing filenames in a legacy
    encoding (possibly even more than one). If one wants to switch to UTF-8
    filenames, what is one supposed to do? Convert all filenames to UTF-8? Who
    will do that? And when? Will all users agree? Should all filenames that do
    not conform to UTF-8 be declared invalid? And those files innacessible? If
    you keep all processing in UTF-8, then this is a decision you can postpone.
    But if you start using UTF-32 applications for processing filenames, invalid
    sequences will be dropped and those files can in fact become inaccessible.
    And then you'll be wondering why users don't want to start using Unicode.

    I didn't encourage users to mix UTF-8 filenames and Latin 1 filenames. Do
    you want to discourage them?

    >
    > Among other things, you run the risk that the mystery data happens to
    > form a valid UTF-8 sequence, by sheer coincidence. The example of
    > "NESTLÉ™" in Windows CP1252 is applicable here. The last two
    > bytes are
    > C9 99, a valid UTF-8 sequence for U+0259. By applying the concept of
    > "adaptive UTF-8" (as Dan Oscarsson called it in 1998), this sequence
    > would be interpreted as valid UTF-8, and data loss would occur.
    I am well aware of that risk. But all you risk is that you lose the 'late
    detection' of what happened. But actually data has not been lost in this
    case. It is however lost in all other cases, for example in NESTLÉ followed
    by a space. That is, if for whatever reason this byte sequence happens to be
    in a UTF-8 encoded text. Or if you named your file NESTLÉ.rtf and then
    switched your locale to UTF-8.

    >
    > > 2.4 - Any data that was stored as-is may contain invalid sequences,
    > > but these are stored as such, in their original form.
    > Therefore, it is
    > > possible to raise an exception (alert) when the data is retrieved.
    > > This warns the user that additional caution is needed. That was not
    > > possible in 1.4.
    >
    > This is where the fatal mistake is made. No matter what Unicode
    > encoding form is used, its entire purpose is to encode *Unicode code
    > points*, not to implement a two-level scheme that supports
    > both Unicode
    Ability to store code points (and define all the codepoints that we need) is
    one thing. Detecting corrupted data and trying to get rid of legacy text for
    which encoding is not known is another. The two goals should not be tied one
    to another. Especially not since we cannot be sure that goal #2 can be
    achieved at all. Nor that everybody agrees it is a goal in the first place.
    IMHO, preserving data is more important, but so far it seems it is not a
    goal at all. With a simple argument - that Unicode only defines how to
    process Unicode data. Understandably so, but this doesn't mean it needs to
    remain so.

    > and non-Unicode data. What sort of "exception" is to be raised? What
    > sort of "additional caution" should the user take? What if
    > this process
    > is not interactive, and contains no user intervention?
    Precisely the same questions I need to ask you for the process of converting
    the data. You are saying that data that cannot be converted should not be
    converted. Then what if THIS process is not interactive? What if there is no
    manpower that could ever make it interactive? Simply resort to conversion
    with data loss?
    So, it depends on the application. If you are compiling articles to create a
    book, then you probably should do it on the input side and provide a
    reliable output. But in many cases it is more likely that the retrieval
    process is interactive and that the storing process is not. Think search
    engines, archiving software and such.

    >
    > > 3.1 - Unfortunately we don't live in either of the two
    > perfect worlds,
    > > which makes it even worse. A database on UNIX will typically be (or
    > > can be made to be) 8-bit. Therefore perfectly able to handle UTF-8
    > > data. On Windows however, there is a lot of support for UTF-16, but
    > > trying to work in UTF-8 could prove to be a handicap, if
    > not close to
    > > impossible.
    >
    > UTF-8 and UTF-16, used correctly, are perfectly
    > interchangeable. It is
    > not in any way a fault of UTF-16 that it cannot be used to store
    > arbitrary binary data.
    According to the requirements that were taken into account when designing
    UTF-16 - no, it is not a fault. The fact remains that UTF-8 can coexist with
    binary data. If anyone finds that to be an advantage, then he might see this
    property of UTF-16 as a fault.

    >
    > > 3.3 - For the record: other UTF formats CAN be made equally
    > useful to
    > > UTF-8. It requires 128 codepoints. Back in 2002, I have tried to
    > > convince people on the Unicode mailing list that this
    > should be done,
    > > but have failed.
    >
    > Because it is an incredibly bad idea.
    Until proven otherwise.

    >
    > > I am now using the PUA for this purpose. And I am even
    > tempted to hope
    > > nobody will never realize the need for these 128 codepoints, because
    > > then all my data will be non-standard.
    >
    > You *should* use the PUA for this purpose. It is an excellent
    > application of the PUA. But do not be surprised if someone else,
    > somewhere, decides to use the same 128 PUA code points for some other
    > purpose. That does not make your data "non-standard," because all PUA
    > data, by definition, is "non-standard."
    So we agree it is "non-stadard", even now. And that there are risks
    involved. Hence, this is not a good solution. Unfortunately no other
    solution keeps the data in the same store, allowing it to be indexed as a
    whole. It is the best (I hope) solution given the current limitations. It is
    a solution that is simple and efficient. Why look for other solutions that
    will ultimately yield the same result but will just be a lot of effort to
    get around the consequences of the fact that someone once said that Unicode
    is not supposed to assist in solving this problem. When that is realized,
    the logical next step is to get rid of the remaining problems. That is, to
    move from PUA to BMP. And bite the sour apple of security consequnces.

    Though, you have a serious security issue even with your design. Two strings
    may compare different in UTF-8, but same if converted to UTF-16. Yes,
    assuming they contained invalid sequences. So, you should validate any UTF-8
    data before processing it. A nasty prerequisite. Thought all UTFs were
    equal... If you adopt the 128 codepoints I am proposing, the two strings
    would compare different also in UTF-16. Not saying that my approach doesn't
    introduce any new security considerations, but it is interesting that it
    also solves some.

    > What you are doing
    > with the PUA
    > is far more standard, and far more interoperable, than writing invalid
    > UTF-8 sequences and expecting parsers to interpret them as
    > "undeciphered
    > 8-bit legacy text of some sort."
    Well, you may have a wrong assumption here. You probably think that I
    convert invalid sequences into PUA characters and keep them as such in
    UTF-8. That is not the case. Any invalid sequences in UTF-8 are left as they
    are. If they need to be converted to UTF-16, then PUA is used. If they are
    then converted to UTF-8, they are converted back to their original bytes,
    hence the incorrect sequences are re-created.

    The other name for this is roundtripping. Currently, Unicode allows a
    roundtrip UTF-16=>UTF-8=>UTF-16. For any data. But there are several reasons
    why a UTF-8=>UTF-16(32)=>UTF-8 roundtrip is more valuable, even if it means
    that the other roundtrip is no longer guaranteed:
    * Legacy data is more likely to creep into UTF-8 data than into any other
    UTF.
    * It is inreasingly more evident that UTF-8 is probably the most likely
    candidate for any future storage and data exchange.

    One more example of data loss that arises from your approach:
    If a single bit is changed in UTF-16 or UTF-32, that is all that will happen
    (in more than 99% of the cases). If a single bit changes in UTF-8, you risk
    that the entire character will be dropped or replaced with the U+FFFD. But
    funny, only if it ever gets converted to the UTF-16 or UTF-32. Not that this
    is a major problem on its own, but it indicates that there is something
    fishy in there.

    There was a discussion on nul characters not so long ago. Many text editors
    do not properly preserve nul characters in text files. But it is definitely
    a nice thing if they do. While preserving nul characters only has a limited
    value, preserving invalid sequences in text files could be crucial. A UTF-8
    based editor can easily do this. A UTF-16 based editor cannot do it at all.
    If you say that UTF-16 is not intended for such a purpose, then so be it.
    But this also means that UTF-8 is superior. You can say that this is no
    longer UTF-8, since it is misused. But that doesn't make a difference. If it
    is superior because it can be misused, it is still superior.

    >
    > > 4.1 - UTF-32 is probably very useful for certain string operations.
    > > Changing case for example. You can do it in-place, like you could
    > > with ASCII. Perhaps it can even be done in UTF-8, I am not sure. But
    > > even if it is possible today, it is definitely not
    > guaranteed that it
    > > will always remain so, so one shouldn't rely on it.
    >
    > Not only is this not 100% true, as others have pointed out, but it is
    > completely irrelevant to your other points.
    Yes, it is not related much. Except for the fact I was trying to see if
    UTF-32 is needed at all. If one can do everything in UTF-8, then invalid
    sequences can be preserved. If, however, certain things can only be done by
    converting to UTF-32, then the application is again risking data losses. Can
    be avoided, but extra care needs to be taken. And if the way to do it would
    be standardized (hopefully also simplified), it would help a lot.
    >
    > > 4.2 - But UTF-8 is superior. You can make UTF-8 functions ignore
    > > invalid sequences and preserve them. But as soon as you
    > convert UTF-8
    > > to anything else, problems begin. You cannot preserve invalid
    > > sequences if you convert to UTF-16 (except by using unpaired
    > > surrogates). You can preserve invalid sequences when converting to
    > > UTF-32, but this again means you need to use undefined values (above
    > > 21 bits) in addition to modifying the functions so they do
    > not modify
    > > these values. But then again, if one is to use these
    > values, then they
    > > should be standardized. If so, why use the hyper-values,
    > why not have
    > > them in Unicode?
    >
    > This is completely wrong-headed. The whole purpose of UTF-8 is to
    > represent valid Unicode code points, convertible to any other encoding
    > form.
    I agree. Again, this is a question of requirements. Current requirements
    have been met, no doubt about that. I am simply saying that new requirements
    should be added. Or at least considered. But fairly, not rejected simply
    because they were not considered before.

    >
    > > 5.1 - One could say that UTF-8 is inferior, because it has invalid
    > > sequences to start with. But UTF-16 and UTF-32 also have invalid
    > > sequences and/or values. The beauty of UTF-8 is that it can coexist
    > > with legacy 8-bit data. One is tempted to think that all we
    > need is to
    > > know what is old and what is new and that this is also a benefit on
    > > its own. But this assumption is wrong. You will always come across
    > > chunks of data without any external attributes. And isn't that what
    > > 'plain text' is all about? To be plain and self contained.
    > Stateless.
    > > Is UTF-16 stateless, if it needs the BOM? Is UTF-32LE
    > stateless if we
    > > need to know that it is UTF-32LE? Unfortunately we won't be able to
    > > get rid of them. But I think they should not be used in
    > data exchange.
    > > And not even for storage, wherever possible. That is what I see as a
    > > long term goal.
    >
    > I don't even know where to begin with this rant.
    >
    > Is it really "stateless" to store textual data in some
    > mystery encoding,
    > such that we don't know what it is, but we insist on storing
    > it anyway,
    > undeciphered, like some relic from an archaeological dig? What is the
    > use of such data? Can it really be considered plain text at all?
    Still much better than rejecting the data. Not that rejecting the data is
    not a valid option sometimes. But currently it is not an option, it is the
    only thing that can be done.

    >
    > Is it "stateless" to redefine UTF-8 so that <C9 99> is a two-byte
    > sequence that means U+0259, but <99 C9> is two characters in
    > an unknown
    > legacy encoding? How is this any more "stateless" than the UTF-32LE
    > example above?
    >
    > Sorry to be so harsh and sound so personal. Frankly, I am amazed that
    > in the 39 hours since your message was posted, this mailing
    > list has not
    > become absolutely flooded with responses explaining what a truly bad
    > idea this is.
    I hope it is because it is not as bad as you still think.

    >
    > -Doug Ewell
    > Fullerton, California
    > http://users.adelphia.net/~dewell/
    >
    >

    Lars



    This archive was generated by hypermail 2.1.5 : Mon Dec 06 2004 - 08:14:19 CST