From: Lars Kristan (lars.kristan@hermes.si)
Date: Fri Dec 03 2004 - 07:45:37 CST
Theodore H. Smith wrote:
> What would be the nicest UTF to use?
>
> I think UTF8 would be the nicest UTF.
>
I agree. But not for reasons you mentioned. There is one other important
advantage: UTF-8 is stored in a way that permits storing invalid sequences.
I will need to elaborate that, of course.
1.1 - Let's suppose a perfect world where we decided to have only UTF-16
(perfect in its simplicity, not strategy). You have various 8-bit data from
the non-perfect past. Any data for which the encoding is known is converted
to Unicode. Any errors (invalid sequences, unmappable values) are replaced
with U+FFFD and logged or reported.
1.2 - Any data for which encoding is not known can only be stored in a
UTF-16 database if it is converted. One needs to choose a conversion (say
Latin-1, since it is trivial). When a user finds out that the result is not
appealing, the data needs to be converted back to the original 8-bit
sequence and then the user (or an algorithm) can try various encodings until
the result is appealing.
1.3 - One is tempted to use a heuristic algorithm right from the start. But
if it makes a wrong decision, you will have to first guess what it chose to
undo it, and only then you can start searching for the correct conversion.
1.4 - I am assuming that storing the history of what was done is not
possible or is impractical. There are cases where this assumption is more
than valid. In addition to 1.3, there is an even more general problem. You
don't know which data was converted using a good hint and which was
converted using the default conversion. Once converted, this latter data may
seem correct at first if the conversion affected only a few characters.
1.5 - A better choice for the default conversion would be to use the UTF-8
to UTF-16 conversion. If the data is really UTF-8, then we've got what we
wanted. For anything else a lot of data will be lost (converted to many
useless U+FFFD characters).
2.1 - In a better perfect world, we decide to have only UTF-8. Any data for
which the encoding is known is converted to Unicode. Any errors (invalid
sequences, unmappable values) are marked with U+FFFD and logged or reported.
This is the same as in the first world, except that UTF-8 is used to store
the Unicode data.
2.2 - Any data for which encoding is not known can simply be stored as-is.
2.3 - Again, it is not advisable to attempt to determine the encoding,
unless this process is made very reliable. Typically, this can be achieved
with larger chunks of data, but may be impossible on small chunks, even if
the process is human-assisted.
2.4 - Any data that was stored as-is may contain invalid sequences, but
these are stored as such, in their original form. Therefore, it is possible
to raise an exception (alert) when the data is retrieved. This warns the
user that additional caution is needed. That was not possible in 1.4.
3.1 - Unfortunately we don't live in either of the two perfect worlds, which
makes it even worse. A database on UNIX will typically be (or can be made to
be) 8-bit. Therefore perfectly able to handle UTF-8 data. On Windows
however, there is a lot of support for UTF-16, but trying to work in UTF-8
could prove to be a handicap, if not close to impossible.
3.2 - Adding more UTF-8 support to Windows is of course the right thing to
do. But that takes time. And it just opens the possibility for everyone to
make use of the superior UTF-8 format.
3.3 - For the record: other UTF formats CAN be made equally useful to UTF-8.
It requires 128 codepoints. Back in 2002, I have tried to convince people on
the Unicode mailing list that this should be done, but have failed. I am now
using the PUA for this purpose. And I am even tempted to hope nobody will
never realize the need for these 128 codepoints, because then all my data
will be non-standard.
4.1 - UTF-32 is probably very useful for certain string operations. Changing
case for example. You can do it in-place, like you could with ASCII. Perhaps
it can even be done in UTF-8, I am not sure. But even if it is possible
today, it is definitely not guaranteed that it will always remain so, so one
shouldn't rely on it.
4.2 - But UTF-8 is superior. You can make UTF-8 functions ignore invalid
sequences and preserve them. But as soon as you convert UTF-8 to anything
else, problems begin. You cannot preserve invalid sequences if you convert
to UTF-16 (except by using unpaired surrogates). You can preserve invalid
sequences when converting to UTF-32, but this again means you need to use
undefined values (above 21 bits) in addition to modifying the functions so
they do not modify these values. But then again, if one is to use these
values, then they should be standardized. If so, why use the hyper-values,
why not have them in Unicode?
5.1 - One could say that UTF-8 is inferior, because it has invalid sequences
to start with. But UTF-16 and UTF-32 also have invalid sequences and/or
values. The beauty of UTF-8 is that it can coexist with legacy 8-bit data.
One is tempted to think that all we need is to know what is old and what is
new and that this is also a benefit on its own. But this assumption is
wrong. You will always come across chunks of data without any external
attributes. And isn't that what 'plain text' is all about? To be plain and
self contained. Stateless. Is UTF-16 stateless, if it needs the BOM? Is
UTF-32LE stateless if we need to know that it is UTF-32LE? Unfortunately we
won't be able to get rid of them. But I think they should not be used in
data exchange. And not even for storage, wherever possible. That is what I
see as a long term goal.
> Its too bad MicroSoft and Apple didn't realise the same, before they
> made their silly UCS-2 APIs.
I think UTF-8 didn't exist at the time they were making the decisions. Or am
I wrong?
Lars
This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 07:50:32 CST