Re: Proposing UTF-21/24

From: Frank Ellermann (nobody@xyzzy.claranet.de)
Date: Sat Jan 20 2007 - 19:32:59 CST

  • Next message: David Starner: "Re: Proposing UTF-21/24"

    Ruszlan Gaszanov wrote:

    > Any comments?

    Some of your arguments like "won't need a BOM anymore" don't make
    sense for me, but the UTF-24A idea is nice. Even if I'm lost in
    a sequence of UTF-24A octets I can always find the start or end
    of a UTF-24A code point: 1P0 can be 100 or 110, therefore bytes
    with MSB 1 are the start unless the previous byte also has MSB 1,
    and then the previous byte is the start. Similarly LSB 0 could
    be used to determine the end.

    One disadvantage of your scheme, unlike UTF-8 it can't be directly
    expressed in CharmapML, the parity bit destroys simple patterns,
    and an enumeration of 2**21 (minus surrogates) code points won't
    fly. But BOCU-1 has the same issue, that's no showstopper.

    Maybe you could use a trick, instead of 1P0 use 100 and 110 for
    UTF-24E (even) and UTF-24O (odd) CharmapML descriptions, and a
    comment that one half of the real UTF-24 corresponds to UTF-24E,
    and the other half to UTF-24O.

    Compare <http://purl.net/xyzzy/home/test/utf-8.xml> for one of my
    two CharmapML experiments.

    Frank



    This archive was generated by hypermail 2.1.5 : Sat Jan 20 2007 - 19:47:45 CST