From: Doug Ewell (firstname.lastname@example.org)
Date: Sun Jan 21 2007 - 21:59:30 CST
Ruszlan Gaszanov <ruszlan at ather dot net> wrote:
>> Some of your arguments like "won't need a BOM anymore" don't make
>> sense for me...
> Well, since conversions between UTF-21/24 and UTF-32 (and UTF-16 for
> BMP characters) is very trivial - much more so then with UTF-8, some
> applications designers might prefer to use the same byte order for
> UTF-21/24 as they are using for UTF-16/32 in order to make processing
> faster. Hence we might get BE/LE varieties of UTF-21/24 and have to
> deal with BOM issue. Therefore, the error dedection mechanisms I
> proposed for UTF-24 varieties also allow automatic byte order
Conversion to and from UTF-8 is really quite simple. It may look like a
lot of lines of code, but most of it is conditional -- only one of the
branches runs for each lead byte.
Ruszlán, take it from me: I was a well-known inventor of alternative
UTFs several years ago, and as far back as 1998 I came up with a
compression scheme that vaguely resembled SCSU window switching
(simpler, but less efficient). Gradually and patiently, I was persuaded
(and saw for myself) that these alternative schemes had no chance of
widespread adoption. Even if they were better, they were not "better
enough." Eventually, after learning quite a bit about encoding
strategies and Unicode policy, I stopped invented and learned to embrace
the existing encoding schemes.
Your ideas reminded me of the variable-length scheme Frank mentioned.
(I thought I had invented that one too, based on Mark Crispin's
mostly-whimsical UTF-9 RFC.) I actually do use that scheme for some
internal purposes, not all of which have to do with Unicode code points.
For example, I'm working on a Unicode-enabled Huffman encoder that uses
the variable-length scheme to store charatcer frequencies. It has the
advantage of not being limited to 0x10FFFF or any other number, and for
this purpose the "false ASCII find" problem is not a problem at all.
But for storage purposes, you don't want to use 3 bytes for each
character -- not with the overwhelming prevalence of BMP characters in
almost all text. There's a reason why almost nobody uses UTF-32;
cutting the storage from four bytes to three won't change that. And for
interchange, you don't want the overhead of calculating or checking
parity for each 3-byte series. It's not as computationally cheap as it
seems, compared to decoding UTF-8 or even SCSU. (The complexity of
decoding SCSU is vastly overstated, as I wrote in Unicode Technical Note
It's true that you don't need a Byte Order Mark per se with a byte-based
encoding such as this, but you might still want to be able to use U+FEFF
as an encoding signature. All Unicode encodings have this defined. The
problem with U+FEFF is not so much its use as a byte order mark or
signature, but rather its parallel and conflicting use as a zero-width
no-break space (which was never widely used and which is now
-- Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14 http://users.adelphia.net/~dewell/ http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages
This archive was generated by hypermail 2.1.5 : Sun Jan 21 2007 - 22:01:24 CST