Re: Unicode & space in programming & l10n

From: Doug Ewell (
Date: Mon Sep 25 2006 - 17:39:43 CST

  • Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"

    Hans Aberg <haberg at math dot su dot se> wrote:

    > I originally misinterpreted what you said, as in math, what you say
    > would be phrased something like: high frequency words are likely to
    > have occurrences in the first few sentences of the text.

    That is what he said. High-frequency words are more likely to occur
    everywhere within the text. That's what makes them high-frequency.

    >> ...and this is clearly off-topic for the list.
    > Perhaps, perhaps not: it might be good to clarify the wished-for
    > properties of a compressed natural language text body. From time to
    > time, people discussing on this list, want to use this or other
    > Unicode character encoding for compressing purposes.

    *red flag*

    It's always dangerous to think in terms of using Unicode character
    encoding schemes for compression, because:

    1. if the data being compressed is not "Unicode code points," but looks
    like it, there is a chance of misinterpreting the data and confusing the
    two issues, and

    2. most Unicode character encoding schemes are not intended, or
    optimal, for compression.

    You can certainly build a compression model that encodes frequent items
    in fewer bits and rare items in more bits -- that's pretty much what all
    compression methods do -- and you can apply some of the concepts
    employed in UTF-8, or double-byte character sets like JIS, to help build
    this model.

    But as soon as you start "using UTF-8" or some other Unicode CES to
    compress non-Unicode data, you are not only missing the point of
    UTF-8 -- it's intended for ASCII transparency combined with complete
    Unicode coverage, NOT for compression -- but you are getting a rather
    poor general-purpose compression model to boot.

    For example, one of the most commonly mentioned beneficial features of
    UTF-8 (maybe too commonly) is that the byte patterns allow forward and
    backward scanning. That feature is great for text processing, but not
    very important for compression, and it reduces the number of possible
    N-byte combinations, which decreases performance.

    Be sure you understand the job at hand, and be careful to use the right
    tools for the job. A hammer makes a great hammer, but a lousy

    Doug Ewell
    Fullerton, California, USA
    RFC 4645  *  UTN #14

    This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 17:43:25 CST