Re: Proposing UTF-21/24

From: Frank Ellermann (nobody@xyzzy.claranet.de)
Date: Sun Jan 21 2007 - 16:30:24 CST

Next message: Mike: "Re: Regulating PUA."

Previous message: Michael Maxwell: "RE: Proposing UTF-21/24"
In reply to: David Starner: "Re: Proposing UTF-21/24"
Next in thread: Mike: "Re: Proposing UTF-21/24"
Reply: Mike: "Re: Proposing UTF-21/24"
Reply: Mark Davis: "Re: Proposing UTF-21/24"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

David Starner wrote:

> current encodings designed with a extreme concern for size, like
> SCSU and BOCU, frequently aren't used, because UTF-8 or UTF-16
> combined with a general purpose compression scheme works much
> better for any long text.

Yes, but the 3*7 approach is still fascinating because it's so
simple. When UTF-8 was invented they couldn't do this, they
needed something for 31 bits.

With 3*7 it's (in theory) possible to replace UTF-8 by "UTF-24"
using the "self delimiting numeric values" (SDNV) proposed in
<http://tools.ietf.org/html/draft-eddy-dtn-sdnv>

Each octet transports 7 bits ?1234567. If the most significant
bit is a 0 it's the terminating octet, otherwise another octet
follows. With that you'd get:

1x 1y 0z => 21 bits (for 1x different from 1000 0000)
1x 0y => 14 bits (for 1x different from 1000 0000)
0x => 7 bits (the ASCII range)

Of course the 0y or 0z in multibyte sequences could cause havoc,
especially for 0000 0000, but in theory it's simpler than UTF-8.

Frank

Next message: Mike: "Re: Regulating PUA."
Previous message: Michael Maxwell: "RE: Proposing UTF-21/24"
In reply to: David Starner: "Re: Proposing UTF-21/24"
Next in thread: Mike: "Re: Proposing UTF-21/24"
Reply: Mike: "Re: Proposing UTF-21/24"
Reply: Mark Davis: "Re: Proposing UTF-21/24"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jan 21 2007 - 16:34:41 CST