Re: Sync/Seek-robust UTF-7

From: David Starner (
Date: Tue Jun 18 2002 - 16:44:03 EDT

At 10:21 AM 6/18/02 +0000, Shlomi Tal wrote:
>Stateful, yes... fragile, no! Any relevance, or is this just an amusing experiment to be kept among geeks privately?

There's a huge number of features to be traded off when making a UTF: complexity, encoding/decoding speed, uniqueness, statefulness, seekability (being able to start an arbitrary point in the stream
and finding character boundaries either (a) right off, or (b) after
short seeks back or forward.) The current UTFs made certain choices,
and it's generally thought better to stick with the UTFs we have,
for simplicity's sake, than add more that don't make radically new
choices. Given that a 7-bit UTF is not a major need, and UTF-8 is
more often used even in UTF-7's home field of email, I don't see
why a new UTF would be more than an amusing experiment. UTF-7
works well enough for what it does.

That said, I've been working on my own UTF, privately dubbed
ISO-2022-UTF. It does end up mapping 96-character planes to
G0, but ISO-2022-JP-3 does it, and that's a MIME-legal charset.

U+0000-U+007F (ASCII) ESC 2/8 4/2
U+0000-U+23FF ESC 2/14 3/1
U+2400-U+47FF ESC 2/14 3/2
U+4800-U+6BFF ESC 2/14 3/3
U+6C00-U+8FFF ESC 2/14 3/4
U+9000-U+B3FF ESC 2/14 3/5
U+B400-U+D7FF ESC 2/14 3/6
U+D800-U+FBFF ESC 2/14 3/7
U+FC00-U+11FFF ESC 2/14 3/8

ISO-2022-UTF starts with ASCII in G0 and normal C0 in C0. It's
invalid to use ESC 2/14 3/1 for characters in ASCII. For characters
above 11FFF, surrogate characters are used. For characters between
10000 and 11FFF, ESC 2/14 3/8 should be used. When used as a mime
charset, it is suggested that every line end with a return to ASCII,
for compatibility with ISO-2022-JP-*. Also when used as a mime
charset, CRLF must be used as a line ending.

I don't see a real use for it, and as is it could use some
formalization before actual use. But it seems like a workable enough
design for an ISO-2022 UTF.

This archive was generated by hypermail 2.1.2 : Tue Jun 18 2002 - 13:08:25 EDT