RE: CESU-8 marches on

From: Lars Kristan (lars.kristan@hermes.si)
Date: Mon Dec 24 2001 - 14:03:51 EST


DougEwell2@cs.com wrote:
>
> There is also a concern that CESU-8 is really just a
> variation of UTF-8,
> allowing (nay, requiring) sequences that are illegal in UTF-8
> but otherwise
> looking just like UTF-8. This could open security holes that
> the UTC has
> worked hard to close, and is continuing to close in Unicode 3.2.
>

First, I would like to thank Doug for alerting us about what I think is a
dangerous move - the continued promotion of UTF-8S, a.k.a. CESU-8.

For some time now, my focus was on another UTF-8 variation, namely the
UTF-8B. What scares me is that the two UTF-8 mutations seem mutually
exclusive to me. They both break existing rules by using illegal or
irregular sequences, and once one of the two transformations is accepted,
the other one is doomed.

I have to admit one thing though, UTF-8S only has one possible
implementation, while UTF-8B could also be achieved in some other way,
probably by reserving 128 code points, hopefully in the BMP (hear me
laughing). Unfortunately the exceptions it would create would then no longer
be in the domain of irregular sequences which was one of the beauties of its
original design.

Now, for those who are not familiar with UTF-8B, the intent of UTF-8B is to
guarantee the roundtrip from 8-bit data to UTF-16 (or UCS-4) and back. I
think it addresses a problem that will become more and more evident, even in
the near future. This is the so called 'problem of illegal UTF-8 sequences'.

Why is this important? Well, UTF-16 is simple - except for a little mess
with UTF-16 vs. UCS-2, you pretty much know what you have - it's Unicode,
and if a program fails to realize that, the results are catastrophic,
immediately. Which is good.

On the other hand, 8-bit data is very tricky. It can be UTF-8 or it can be
encoded in any SBCS or MBCS codeset there is. From an armchair point of
view, it may look pretty trivial - if your editor encounters an illegal
sequence in the text file, ask the user if the file is to be interpreted as
codeset based, right? Well, how about searching or indexing? Who can the
program ask then? How about presenting a Unix filesystem on the web (in an
html file, marked as UTF-8) or to a UTF-16 based OS, like Windows? And
filenames are very tricky indeed, because you have no way of embedding any
codeset information.

It is my belief, that UTF-8 will become more and more popular on Unix. Some
day in near future (ok, 5 or 10 years?) I expect to see 90% of the filenames
in UTF-8. Everybody will use UTF-8 as their codeset. Can somebody explain to
me, how the remaining 10% will be treated? Producing errors, on open, maybe
even on ls/dir?!

The only way that can be avoided is to standardize a transformation that
will guarantee that any zero terminated sequence of 8-bit characters can be
transformed to Unicode points and back without any data loss. And any
promotion of CESU-8 will take us a step further away from solving this
problem, which I believe is far more important than anything CESU-8 is
addressing.

Let me just finish with saying that I am not making up this problem or just
foreseeing it - I have an actual requirement to store Unix filenames into a
UTF-16 database. Since CESU-8 is not helping me there, I cannot but urge
everyone - if there is going to be a mutation of UTF-8, it should be UTF-8B
and *not* CESU-8.

Merry Christmas to all,

Lars Kristan



This archive was generated by hypermail 2.1.2 : Mon Dec 24 2001 - 13:41:25 EST