Re: Proposing UTF-21/24

From: Mark Davis (mark.davis@icu-project.org)
Date: Sun Jan 21 2007 - 18:47:10 CST

  • Next message: vunzndi@vfemail.net: "Re: Regulating PUA."

    This has the very significant problem of ASCII incompatibility: the key
    advantage of UTF-8 is that values of 0..127 are never part of a multibyte
    character. That is one of the reasons why the simple approach of just using
    7 bits of content with a bit to say "has continuation", while considered,
    never got any traction. (That mechanism for compressing integers or arrays
    of them, on the other hand, is fairly common.)

    IMO, the whole discussion of "UTF-24" is of only academic interest; both
    UTF-8 and UTF-16 have better storage characteristics (remember that 4-byte
    characters have, and will have, extremely low frequency of usage), and for
    in-memory handling "UTF-24" doesn't buy much.

    Mark

    On 1/21/07, Frank Ellermann < nobody@xyzzy.claranet.de> wrote:
    >
    > David Starner wrote:
    >
    > > current encodings designed with a extreme concern for size, like
    > > SCSU and BOCU, frequently aren't used, because UTF-8 or UTF-16
    > > combined with a general purpose compression scheme works much
    > > better for any long text.
    >
    > Yes, but the 3*7 approach is still fascinating because it's so
    > simple. When UTF-8 was invented they couldn't do this, they
    > needed something for 31 bits.
    >
    > With 3*7 it's (in theory) possible to replace UTF-8 by "UTF-24"
    > using the "self delimiting numeric values" (SDNV) proposed in
    > <http://tools.ietf.org/html/draft-eddy-dtn-sdnv >
    >
    > Each octet transports 7 bits ?1234567. If the most significant
    > bit is a 0 it's the terminating octet, otherwise another octet
    > follows. With that you'd get:
    >
    > 1x 1y 0z => 21 bits (for 1x different from 1000 0000)
    > 1x 0y => 14 bits (for 1x different from 1000 0000)
    > 0x => 7 bits (the ASCII range)
    >
    > Of course the 0y or 0z in multibyte sequences could cause havoc,
    > especially for 0000 0000, but in theory it's simpler than UTF-8.
    >
    > Frank
    >
    >
    >
    >

    -- 
    Mark
    


    This archive was generated by hypermail 2.1.5 : Sun Jan 21 2007 - 18:49:21 CST