Re: Proposing UTF-21/24

From: Frank Ellermann (nobody@xyzzy.claranet.de)
Date: Mon Jan 22 2007 - 12:49:30 CST

Next message: Asmus Freytag: "Re: Proposing a DOUBLE HYPHEN punctuation mark"

Previous message: Jon Hanna: "Re: Proposing a DOUBLE HYPHEN punctuation mark"
In reply to: Mark Davis: "Re: Proposing UTF-21/24"
Next in thread: Asmus Freytag: "Re: Proposing UTF-21/24"
Reply: Asmus Freytag: "Re: Proposing UTF-21/24"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark Davis wrote:

> This has the very significant problem of ASCII incompatibility: the
> key advantage of UTF-8 is that values of 0..127 are never part of a
> multibyte character. That is one of the reasons why the simple
> approach of just using 7 bits of content with a bit to say "has
> continuation", while considered, never got any traction.

Yes, "get a 1:1 correspondence for the 128 ASCII octets" was another
goal, in addition to "find something working for 31 bits". And let
a single error destroy only one code point.

For UTF-1 a goal was to protect the 64 control characters, also fine,
but unfortunately not what actually counts for some legacy protocols.
And the modulo 192 in UTF-1 is stranger than the modulo 64 in UTF-8.
Modulo 243 in BOCU-1 is the oddest, protecting 256-243=13 important
ASCII characters.

> IMO, the whole discussion of "UTF-24" is of only academic interest

ACK, the field of compression is explored in almost all directions.
My own experiments go in the opposite direction, expansion: protect
224 Latin-1 characters (C0, G0, G1) instead of only ASCII (C0 + G0),
and use the remaining 32 octets to encode any code point outside of
the "visible Latin-1 or C0" set. Only legacy text applications can
really use this for documents mostly in Latin-1. With that I got a
modulo 16 (hex.) "UTF-4" scheme, otherwise the same design as UTF-8.

But a decent escape mechanism with hex. XML NCRs is good enough, and
so "UTF-4" is also only academic. At least it convinced that it's
impossible to "improve" UTF-8 without giving up one or more of its
design goals.

Frank

Next message: Asmus Freytag: "Re: Proposing a DOUBLE HYPHEN punctuation mark"
Previous message: Jon Hanna: "Re: Proposing a DOUBLE HYPHEN punctuation mark"
In reply to: Mark Davis: "Re: Proposing UTF-21/24"
Next in thread: Asmus Freytag: "Re: Proposing UTF-21/24"
Reply: Asmus Freytag: "Re: Proposing UTF-21/24"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 22 2007 - 12:55:46 CST