Re: Sync/Seek-robust UTF-7

From: Markus Scherer (
Date: Tue Jun 18 2002 - 11:53:16 EDT

Shlomi Tal wrote:

> If you think 7-bit issues are totally obsolete, then sorry for bothering...

Personally, I think they are, but I do find encoding schemes entertaining :-)

> UTF-7 is both stateful and fragile. Stateful it has to be, because any

Fragile. You assume lossy transport instead of trusting the error correction of the lower layers.

> attemp to encode a large charset AND maintain compatibility to ASCII has
> to be stateful.

... if you also care to stay within 7 bits.

> However, it is also fragile in that there is no
> self-sync or seek coherence (that's the advantage of UTF-8, as we all
> know).
> Borrowing from the idea of ISO-2022-JP extended into EUC, but the other
> way round, I had the following "Gedankenexperiment":
> 00..A0 stay the same
> FF not used
> C2..FE leadbytes (1 leadbyte)
> A1..C1 trailbytes (2 trailbytes)
> allowing 61 x 33 x 33 codepoints - a little more than 65536.

What about the other 1M code points? Would this encode UTF-16 code units?

> And now, with an ISO-2022 sequence for state, reduce to 7-bit:

You seem to imply to just switch between "lower bytes" (00..7f) and "upper bytes" (80..ff), which you can do with just SI/SO without the rest of the ISO 2022 apparatus.

> 42..7E leadbytes (1 leadbyte)
> 21..41 trailbytes (2 trailbytes)

What about 80..9f which would collide with C0 control codes?

What about U+00a0 which would become 20 (space) which might be removed/replaced by emailers in ways that you would not expect for U+00a0?

What about users' complaint of the high byte-per-code point ratio in Unicode encodings?

For everything but ASCII (U+0000..U+007f), UTF-7 uses 2.67 B/cp, while this uses 3 B/cp.

> Stateful, yes... fragile, no! Any relevance, or is this just an amusing
> experiment to be kept among geeks privately?

Time will tell. You could ask Doug to add it to his collection :-)


This archive was generated by hypermail 2.1.2 : Tue Jun 18 2002 - 10:30:47 EDT