Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Thu Sep 21 2006 - 06:36:16 CDT

  • Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"

    On 21 Sep 2006, at 07:01, Doug Ewell wrote:

    > Different compression methods work in different ways. Certainly, a
    > compression method that is specifically designed for Unicode text
    > can take advantage of the unique properties of Unicode text, as
    > compared to, say, photographic images.

    I guess that is the simple point - the more structure, one can
    recognize, the better a compression method can be done.

    > I've often suspected that a Huffman or arithmetic encoder that
    > encoded Unicode code points directly would perform better than a
    > byte-based one working with UTF-8 code units. I haven't done the
    > math to prove it, though.

    And specifically, recognizing common words in natural languages is
    something that can be done when working with Unicode code points, and
    this is something that perhaps is harder to do with a byte-
    compression method.

    But it also hinges on how advanced the pattern recognition of a byte-
    oriented compression method is: A character code point pattern can be
    translated into a byte pattern in one character encoding, so it might
    be in principle possible for the byte oriented compression method to
    recognize it. But it5 then needs to be able to recognize multibyte
    patterns, not only single bytes.

       Hans Aberg



    This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 06:38:05 CDT