From: Frank Ellermann (firstname.lastname@example.org)
Date: Mon Jan 22 2007 - 12:49:30 CST
Mark Davis wrote:
> This has the very significant problem of ASCII incompatibility: the
> key advantage of UTF-8 is that values of 0..127 are never part of a
> multibyte character. That is one of the reasons why the simple
> approach of just using 7 bits of content with a bit to say "has
> continuation", while considered, never got any traction.
Yes, "get a 1:1 correspondence for the 128 ASCII octets" was another
goal, in addition to "find something working for 31 bits". And let
a single error destroy only one code point.
For UTF-1 a goal was to protect the 64 control characters, also fine,
but unfortunately not what actually counts for some legacy protocols.
And the modulo 192 in UTF-1 is stranger than the modulo 64 in UTF-8.
Modulo 243 in BOCU-1 is the oddest, protecting 256-243=13 important
> IMO, the whole discussion of "UTF-24" is of only academic interest
ACK, the field of compression is explored in almost all directions.
My own experiments go in the opposite direction, expansion: protect
224 Latin-1 characters (C0, G0, G1) instead of only ASCII (C0 + G0),
and use the remaining 32 octets to encode any code point outside of
the "visible Latin-1 or C0" set. Only legacy text applications can
really use this for documents mostly in Latin-1. With that I got a
modulo 16 (hex.) "UTF-4" scheme, otherwise the same design as UTF-8.
But a decent escape mechanism with hex. XML NCRs is good enough, and
so "UTF-4" is also only academic. At least it convinced that it's
impossible to "improve" UTF-8 without giving up one or more of its
This archive was generated by hypermail 2.1.5 : Mon Jan 22 2007 - 12:55:46 CST