Re: Nicest UTF

From: Doug Ewell (dewell@adelphia.net)
Date: Mon Dec 06 2004 - 10:22:13 CST

  • Next message: Peter Kirk: "Re: No Invisible Character - NBSP at the start of a word"

    Arcane Jill <arcanejill at ramonsky dot com> wrote:

    > Probably a dumb question, but how come nobody's invented "UTF-24" yet?
    > I just made that up, it's not an official standard, but one could
    > easily define UTF-24 as UTF-32 with the most-significant byte (which
    > is always zero) removed, hence all characters are stored in exactly
    > three bytes and all are treated equally. You could have UTF-24LE and
    > UTF-24BE variants, and even UTF-24 BOMs. Of course, I'm not suggesting
    > this is a particularly brilliant idea, but I just wonder why no-one's
    > suggested it before.

    It has been suggested before, by Pim Blokland on April 3, 2003, in a
    message titled "UTF-24." If you get the digest, it's in Digest V3 #79.

    > The "UTF-24" thing seems a reasonably sensible question though. Is it
    > just that we don't like it because some processors have alignment
    > restrictions or something?

    Almost all do. In addition, no programming language I know of has a
    3-byte-wide integer data type (maybe INTERCAL does), so the efficiency
    of UTF-24 would be wasted in software as well as in hardware.

    Besides that, there were the usual protests that supplementary
    characters would be vanishingly rare in the context of "normal" text,
    and that one should use compression (SCSU/BOCU or GP tools) if size is
    an issue.

    None of this stopped me from experimentally implementing it, of course,
    but I haven't touched it since finishing the implementation.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Mon Dec 06 2004 - 10:26:04 CST