Re: Proposing UTF-21/24

From: Philippe Verdy (
Date: Mon Jan 22 2007 - 02:46:39 CST

  • Next message: Philippe Verdy: "Re: Regulating PUA."

    Well, it is not stupid, and nothing forbids an implementation to use an alternate representation for its internal use, provided that it does not claim that this representation is made for interchange...

    So if there are developers that prefer using so called "UTF-21" or "UTF-24" (names not to be used in any interchange, because the "UTF-" prefix is reserved!) for their local processingor for their internal data storage, I don't see what Unicode or ISO/CEI 10646 is blocking there! But I think that these developers should tag their internal data with something else than "UTF-*", to avoid interoperability problems with other possible future interchanged formats. So I would call them "x-UTF-21" and "x-UTF-24", using the "x-" prefix for private local use! And in this case, such private encoding could also represent non-codepoints (planes 16 andhigher) if they wish for their internal processing, and they will have to find solutions toconvert their document or data to some interchange format (using rich-text formats, and/or PUA characters).

    But it's true that if one want a compact data storage, SCSU has lots of benefits, is really simple to decode, and can be encoded simply (even if the most compact encoding is not so easy to implement). If one wants a predictable encoding (which does not depend on the encoder implementation) BOCU is already there. Both have the merit of being interchangeable, because they are supported by a free public specification.

      ----- Original Message -----
      From: Mark Davis
      To: Frank Ellermann
      Sent: Monday, January 22, 2007 1:47 AM
      Subject: Re: Proposing UTF-21/24

      This has the very significant problem of ASCII incompatibility: the key advantage of UTF-8 is that values of 0..127 are never part of a multibyte character. That is one of the reasons why the simple approach of just using 7 bits of content with a bit to say "has continuation", while considered, never got any traction. (That mechanism for compressing integers or arrays of them, on the other hand, is fairly common.)

      IMO, the whole discussion of "UTF-24" is of only academic interest; both UTF-8 and UTF-16 have better storage characteristics (remember that 4-byte characters have, and will have, extremely low frequency of usage), and for in-memory handling "UTF-24" doesn't buy much.

    This archive was generated by hypermail 2.1.5 : Mon Jan 22 2007 - 02:48:58 CST