RE: Nicest UTF

From: Lars Kristan (
Date: Fri Dec 03 2004 - 07:45:37 CST

  • Next message: Antoine Leca: "Re: current version of unicode-font"

    Theodore H. Smith wrote:

    > What would be the nicest UTF to use?
    > I think UTF8 would be the nicest UTF.

    I agree. But not for reasons you mentioned. There is one other important
    advantage: UTF-8 is stored in a way that permits storing invalid sequences.
    I will need to elaborate that, of course.

    1.1 - Let's suppose a perfect world where we decided to have only UTF-16
    (perfect in its simplicity, not strategy). You have various 8-bit data from
    the non-perfect past. Any data for which the encoding is known is converted
    to Unicode. Any errors (invalid sequences, unmappable values) are replaced
    with U+FFFD and logged or reported.
    1.2 - Any data for which encoding is not known can only be stored in a
    UTF-16 database if it is converted. One needs to choose a conversion (say
    Latin-1, since it is trivial). When a user finds out that the result is not
    appealing, the data needs to be converted back to the original 8-bit
    sequence and then the user (or an algorithm) can try various encodings until
    the result is appealing.
    1.3 - One is tempted to use a heuristic algorithm right from the start. But
    if it makes a wrong decision, you will have to first guess what it chose to
    undo it, and only then you can start searching for the correct conversion.
    1.4 - I am assuming that storing the history of what was done is not
    possible or is impractical. There are cases where this assumption is more
    than valid. In addition to 1.3, there is an even more general problem. You
    don't know which data was converted using a good hint and which was
    converted using the default conversion. Once converted, this latter data may
    seem correct at first if the conversion affected only a few characters.
    1.5 - A better choice for the default conversion would be to use the UTF-8
    to UTF-16 conversion. If the data is really UTF-8, then we've got what we
    wanted. For anything else a lot of data will be lost (converted to many
    useless U+FFFD characters).

    2.1 - In a better perfect world, we decide to have only UTF-8. Any data for
    which the encoding is known is converted to Unicode. Any errors (invalid
    sequences, unmappable values) are marked with U+FFFD and logged or reported.
    This is the same as in the first world, except that UTF-8 is used to store
    the Unicode data.
    2.2 - Any data for which encoding is not known can simply be stored as-is.
    2.3 - Again, it is not advisable to attempt to determine the encoding,
    unless this process is made very reliable. Typically, this can be achieved
    with larger chunks of data, but may be impossible on small chunks, even if
    the process is human-assisted.
    2.4 - Any data that was stored as-is may contain invalid sequences, but
    these are stored as such, in their original form. Therefore, it is possible
    to raise an exception (alert) when the data is retrieved. This warns the
    user that additional caution is needed. That was not possible in 1.4.

    3.1 - Unfortunately we don't live in either of the two perfect worlds, which
    makes it even worse. A database on UNIX will typically be (or can be made to
    be) 8-bit. Therefore perfectly able to handle UTF-8 data. On Windows
    however, there is a lot of support for UTF-16, but trying to work in UTF-8
    could prove to be a handicap, if not close to impossible.
    3.2 - Adding more UTF-8 support to Windows is of course the right thing to
    do. But that takes time. And it just opens the possibility for everyone to
    make use of the superior UTF-8 format.
    3.3 - For the record: other UTF formats CAN be made equally useful to UTF-8.
    It requires 128 codepoints. Back in 2002, I have tried to convince people on
    the Unicode mailing list that this should be done, but have failed. I am now
    using the PUA for this purpose. And I am even tempted to hope nobody will
    never realize the need for these 128 codepoints, because then all my data
    will be non-standard.

    4.1 - UTF-32 is probably very useful for certain string operations. Changing
    case for example. You can do it in-place, like you could with ASCII. Perhaps
    it can even be done in UTF-8, I am not sure. But even if it is possible
    today, it is definitely not guaranteed that it will always remain so, so one
    shouldn't rely on it.
    4.2 - But UTF-8 is superior. You can make UTF-8 functions ignore invalid
    sequences and preserve them. But as soon as you convert UTF-8 to anything
    else, problems begin. You cannot preserve invalid sequences if you convert
    to UTF-16 (except by using unpaired surrogates). You can preserve invalid
    sequences when converting to UTF-32, but this again means you need to use
    undefined values (above 21 bits) in addition to modifying the functions so
    they do not modify these values. But then again, if one is to use these
    values, then they should be standardized. If so, why use the hyper-values,
    why not have them in Unicode?

    5.1 - One could say that UTF-8 is inferior, because it has invalid sequences
    to start with. But UTF-16 and UTF-32 also have invalid sequences and/or
    values. The beauty of UTF-8 is that it can coexist with legacy 8-bit data.
    One is tempted to think that all we need is to know what is old and what is
    new and that this is also a benefit on its own. But this assumption is
    wrong. You will always come across chunks of data without any external
    attributes. And isn't that what 'plain text' is all about? To be plain and
    self contained. Stateless. Is UTF-16 stateless, if it needs the BOM? Is
    UTF-32LE stateless if we need to know that it is UTF-32LE? Unfortunately we
    won't be able to get rid of them. But I think they should not be used in
    data exchange. And not even for storage, wherever possible. That is what I
    see as a long term goal.

    > Its too bad MicroSoft and Apple didn't realise the same, before they
    > made their silly UCS-2 APIs.

    I think UTF-8 didn't exist at the time they were making the decisions. Or am
    I wrong?


    This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 07:50:32 CST