Re: Unicode & space in programming & l10n

From: Doug Ewell (
Date: Thu Sep 21 2006 - 08:34:13 CDT

  • Next message: Philippe Verdy: "Re: Unicode & space in programming & l10n"

    Hans Aberg <haberg at math dot su dot se> wrote:

    > Another method, which enables compressing both characters (code
    > points) and natural language words (sequences of code points), might
    > be to make modified UTF-8, where the leading byte admits indicating
    > two categories of numbers. (Continued below.)

    Whatever you do, do NOT call it "UTF-anything."

    I'm currently compressing names in the Unicode character list using a
    variable-length byte-based scheme that encodes common words like LETTER
    in 1 byte and rare words like SPATHI in two bytes. The range of trail
    bytes is allowed to overlap the range of lead bytes, since backward
    parsing doesn't matter for this specific application. It has some
    characteristics in common with UTFs, but it isn't a UTF and I pledge not
    to call it one.

    Doug Ewell
    Fullerton, California, USA
    RFC 4645  *  UTN #14

    This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 08:36:13 CDT