Re: Proposing UTF-21/24

From: Frank Ellermann (
Date: Sun Jan 21 2007 - 16:30:24 CST

  • Next message: Mike: "Re: Regulating PUA."

    David Starner wrote:

    > current encodings designed with a extreme concern for size, like
    > SCSU and BOCU, frequently aren't used, because UTF-8 or UTF-16
    > combined with a general purpose compression scheme works much
    > better for any long text.

    Yes, but the 3*7 approach is still fascinating because it's so
    simple. When UTF-8 was invented they couldn't do this, they
    needed something for 31 bits.

    With 3*7 it's (in theory) possible to replace UTF-8 by "UTF-24"
    using the "self delimiting numeric values" (SDNV) proposed in

    Each octet transports 7 bits ?1234567. If the most significant
    bit is a 0 it's the terminating octet, otherwise another octet
    follows. With that you'd get:

    1x 1y 0z => 21 bits (for 1x different from 1000 0000)
    1x 0y => 14 bits (for 1x different from 1000 0000)
    0x => 7 bits (the ASCII range)

    Of course the 0y or 0z in multibyte sequences could cause havoc,
    especially for 0000 0000, but in theory it's simpler than UTF-8.


    This archive was generated by hypermail 2.1.5 : Sun Jan 21 2007 - 16:34:41 CST