Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Thu Sep 21 2006 - 09:28:14 CDT

  • Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"

    On 21 Sep 2006, at 15:34, Doug Ewell wrote:

    >> Another method, which enables compressing both characters (code
    >> points) and natural language words (sequences of code points),
    >> might be to make modified UTF-8, where the leading byte admits
    >> indicating two categories of numbers. (Continued below.)
    >
    > Whatever you do, do NOT call it "UTF-anything."

    Don't worry. :-)

    > I'm currently compressing names in the Unicode character list using
    > a variable-length byte-based scheme that encodes common words like
    > LETTER in 1 byte and rare words like SPATHI in two bytes.

    So then, why not (if this is not what you already is doing) just take
    a large English text body, and compute the statistics of the words in
    it. Then sort the list, putting the more frequent words first, and
    give the words the number they have in this list. Then apply UTF-8 to
    that (or some other variable length encoding), or if words that are
    infrequent are not encoded at all, but just represented as character
    by character, a character/word-modification, and you have your
    variable length word encoding. (The modification of UTF-8, giving
    separate numbers to words and characters, that comes to my mind, is
    that the leading byte is given the form 1...10nx..., where say 0 =
    character, 1 = word. Points are that small non-negative integers are
    given shorter binary representation, and that the different character/
    word numberings are kept separate. So it is easy to play around with
    other modifications.)

    > The range of trail bytes is allowed to overlap the range of lead
    > bytes, since backward parsing doesn't matter for this specific
    > application.

    The idea of UTF-8 to avoid trail-byte range overlap probably isn't
    important in these compression schemes. So then more bit-efficient
    encodings might be developed. For example, one variable byte, and one
    variable bit.

    > It has some characteristics in common with UTFs, but it isn't a UTF
    > and I pledge not to call it one.

    I'm not in the naming business. :-)

       Hans Aberg



    This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 09:30:50 CDT