From: David Starner (firstname.lastname@example.org)
Date: Sun Jan 21 2007 - 07:26:19 CST
On 1/20/07, Ruszlan Gaszanov <email@example.com> wrote:
> Why would we need a new UTF? Well, of all currently available encoding schemed for Unicode, only UTF-32 is fixed-length. However, while it might be convenient for internal processing on 32/64bit platforms, 11 spare bits per code unit is much too wasteful for long-term storage and interchange. Again if we have spare bits, why not just as well make them useful for, let+IBk-s say, error detection or avoiding undesired sequences (like NUL).
You don't need fixed-length for long-term storage and interchange.
Frankly, any long-term storage and interchange that doesn't use a
general purpose compression scheme is wasteful; bzip compression runs
about 3 bits per character for alphabetic text and less than 7 bits
per character for ideographic text. Bzip also includes some degree of
error detection in that, but there are many better tools for serious
For avoiding undesired sequences, UTF-8 does that quite well. Many
tools that need undesired sequences avoided tend to also assume that
0x00-0x7f is ASCII, which UTF-8 supports. I think it notable that
UTF-7, which was designed to avoid undesired sequences for email tends
to be poorly supported; for example, Google mail seems to have mangled
the UTF-7 in your post. Instead, a general purpose encoding, usually
Base64, is used to encodes both the text and the attachments without
concern for the details of the contents.
To call for a new UTF requires evidence that someone will actually use
it. As pointed above, UTF-7, which avoids non-mail safe characters, is
rarely used. Likewise, current encodings designed with a extreme
concern for size, like SCSU and BOCU, frequently aren't used, because
UTF-8 or UTF-16 combined with a general purpose compression scheme
works much better for any long text. As for fixed length encodings,
again, the existing UTF-32 tends to play second fiddle to UTF-8 and
UTF-16. I don't see the demand for the existing fixed length encoding
to be enough to introduce a second one.
This archive was generated by hypermail 2.1.5 : Sun Jan 21 2007 - 07:30:25 CST