Re: Sync/Seek-robust UTF-7

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Jun 18 2002 - 18:03:25 EDT


Shlomi Tal <shlompi at hotmail dot com> wrote, and Markus Scherer
<markus dot scherer at jtcsv dot com> responded, regarding Shlomi's
experimental UTF.

Please note, before anyone gets the wrong idea, that these experimental
UTFs are *not* intended as candidates to replace the official ones. As
far as I am concerned, they are for fun. Not all are "jokes" in the
sense of being ridiculous or absurd, however. Many are meant as
intellectual exercises, to help understand the thought process behind
designing a good UTF (e.g. what made UTF-8 so much more successful than
UTF-1, which was superior in some ways?)

For a good sense of what goes on in the mind of someone like me or
Shlomi, or Markus, or Marco Cimarosti, when we invent these things, see
the Jargon File entry on "hacker humor" (example 2 in particular):

http://www.tuxedo.org/~esr/jargon/html/entry/hacker-humor.html

Now on to the discussion.

[Shlomi]
> If you think 7-bit issues are totally obsolete, then sorry for
> bothering...

[Markus]
> Personally, I think they are, but I do find encoding schemes
> entertaining :-)

I agree with Markus. I might have to drag my vestigial "UTF-Fieldata"
concept out of the closet...

[Shlomi]
> UTF-7 is both stateful and fragile. Stateful it has to be, because
> any attemp to encode a large charset AND maintain compatibility to
> ASCII has to be stateful. However, it is also fragile in that there
> is no self-sync or seek coherence (that's the advantage of UTF-8, as
> we all know).

[Markus]
> Fragile. You assume lossy transport instead of trusting the error
> correction of the lower layers.

But people do continue to design file formats with CRCs and other
validity checks. This was a very important feature in the days when our
300-baud modems had lousy error checking. I don't know how valuable it
is today.

[Markus]
> ... if you also care to stay within 7 bits.

Which was the original intent.

[Shlomi]
> Borrowing from the idea of ISO-2022-JP extended into EUC, but the
> other way round, I had the following "Gedankenexperiment":
>
> 00..A0 stay the same
> FF not used
> C2..FE leadbytes (1 leadbyte)
> A1..C1 trailbytes (2 trailbytes)
>
> allowing 61 x 33 x 33 codepoints - a little more than 65536.

Where do the 3-byte sequences begin? Does the sequence C2 A1 A1
represent U+0000 or U+00A1?

In the first case (like UTF-8), you have the possibility of non-shortest
sequences, which you can either allow or forbid. If you allow them as
alternatives to the single-byte form, any search operations that operate
on undecoded data (for whatever reason) must recognize the two
equivalent forms. If you forbid them (again like UTF-8), then all
decoders must be vigilant about forbidding them.

In the second case (like UTF-16), you have no duplicate sequences, but
now you have an additive offset of 0x00A1. Some people find this
annoying about UTF-16.

I don't think either solution is "right" or "wrong," it's just something
you have to think about.

[Markus]
> What about the other 1M code points? Would this encode UTF-16 code
> units?

In private communication, Shlomi indicated that yes, you would need to
apply this algorithm to UTF-16 code units rather than Unicode scalar
values. This is like UTF-7 and CESU-8. (Yuck.)

[Shlomi]
> And now, with an ISO-2022 sequence for state, reduce to 7-bit:
>
> 42..7E leadbytes (1 leadbyte)
> 21..41 trailbytes (2 trailbytes)

[Markus]
> What about 80..9f which would collide with C0 control codes?
>
> What about U+00a0 which would become 20 (space) which might be
> removed/replaced by emailers in ways that you would not expect for
> U+00a0?

Good questions. These would have to be resolved before the 7-bit
variant could work. Personally I place ISO 2022 code page switching in
the same "yuck" category as piggybacking an encoding scheme on top of
UTF-16.

> What about users' complaint of the high byte-per-code point ratio in
> Unicode encodings?
>
> For everything but ASCII (U+0000..U+007f), UTF-7 uses 2.67 B/cp,
> while this uses 3 B/cp.

Another good point. But at least this is easier to encode and decode
than UTF-7.

[Shlomi]
> Stateful, yes... fragile, no! Any relevance, or is this just an
> amusing experiment to be kept among geeks privately?

[Markus]
> Time will tell. You could ask Doug to add it to his collection :-)

Oh goody, I'm famous for something. Oh well, no such thing as bad
publicity, right?

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jun 18 2002 - 16:43:38 EDT