Re: Proposing UTF-21/24

From: David Starner (prosfilaes@gmail.com)
Date: Sun Jan 21 2007 - 07:26:19 CST

  • Next message: Ruszlan Gaszanov: "RE: Proposing UTF-21/24"

    On 1/20/07, Ruszlan Gaszanov <ruszlan@ather.net> wrote:
    > Why would we need a new UTF? Well, of all currently available encoding schemed for Unicode, only UTF-32 is fixed-length. However, while it might be convenient for internal processing on 32/64bit platforms, 11 spare bits per code unit is much too wasteful for long-term storage and interchange. Again if we have spare bits, why not just as well make them useful for, let+IBk-s say, error detection or avoiding undesired sequences (like NUL).

    You don't need fixed-length for long-term storage and interchange.
    Frankly, any long-term storage and interchange that doesn't use a
    general purpose compression scheme is wasteful; bzip compression runs
    about 3 bits per character for alphabetic text and less than 7 bits
    per character for ideographic text. Bzip also includes some degree of
    error detection in that, but there are many better tools for serious
    error detection.

    For avoiding undesired sequences, UTF-8 does that quite well. Many
    tools that need undesired sequences avoided tend to also assume that
    0x00-0x7f is ASCII, which UTF-8 supports. I think it notable that
    UTF-7, which was designed to avoid undesired sequences for email tends
    to be poorly supported; for example, Google mail seems to have mangled
    the UTF-7 in your post. Instead, a general purpose encoding, usually
    Base64, is used to encodes both the text and the attachments without
    concern for the details of the contents.

    To call for a new UTF requires evidence that someone will actually use
    it. As pointed above, UTF-7, which avoids non-mail safe characters, is
    rarely used. Likewise, current encodings designed with a extreme
    concern for size, like SCSU and BOCU, frequently aren't used, because
    UTF-8 or UTF-16 combined with a general purpose compression scheme
    works much better for any long text. As for fixed length encodings,
    again, the existing UTF-32 tends to play second fiddle to UTF-8 and
    UTF-16. I don't see the demand for the existing fixed length encoding
    to be enough to introduce a second one.



    This archive was generated by hypermail 2.1.5 : Sun Jan 21 2007 - 07:30:25 CST