Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Tue Sep 26 2006 - 02:11:58 CST

  • Next message: Steve Summit: "non-IPA primary/secondary stress marks?"

    On 26 Sep 2006, at 01:39, Doug Ewell wrote:

    >> I originally misinterpreted what you said, as in math, what you
    >> say would be phrased something like: high frequency words are
    >> likely to have occurrences in the first few sentences of the text.
    >
    > That is what he said.

    This is not a point of great significance, but I would construe that
    as though it excludes latter occurrences. It is probably just the
    usage in metamathematics, where it is important to keep track of
    those silly little details, that plays a trick on my reading. :-)

    > High-frequency words are more likely to occur everywhere within the
    > text. That's what makes them high-frequency.

    Only if the text is unstructured. String together the human rights
    declaration in 200 different languages, to get a counterexample.

    >> Perhaps, perhaps not: it might be good to clarify the wished-for
    >> properties of a compressed natural language text body. From time
    >> to time, people discussing on this list, want to use this or other
    >> Unicode character encoding for compressing purposes.
    >
    > *red flag*
    >
    > It's always dangerous to think in terms of using Unicode character
    > encoding schemes for compression, because:
    >
    > 1. if the data being compressed is not "Unicode code points," but
    > looks like it, there is a chance of misinterpreting the data and
    > confusing the two issues, and
    >
    > 2. most Unicode character encoding schemes are not intended, or
    > optimal, for compression.
    >
    > You can certainly build a compression model that encodes frequent
    > items in fewer bits and rare items in more bits -- that's pretty
    > much what all compression methods do -- and you can apply some of
    > the concepts employed in UTF-8, or double-byte character sets like
    > JIS, to help build this model.
    >
    > But as soon as you start "using UTF-8" or some other Unicode CES to
    > compress non-Unicode data, you are not only missing the point of
    > UTF-8 -- it's intended for ASCII transparency combined with
    > complete Unicode coverage, NOT for compression -- but you are
    > getting a rather poor general-purpose compression model to boot.

    This is essentially what I am saying: these Unicode character
    encodings should not be used for compression properties. But the
    other side of the coin is that some folks seem to attempt that. In
    that latter vein, one should be able to provide something better.

    > For example, one of the most commonly mentioned beneficial features
    > of UTF-8 (maybe too commonly) is that the byte patterns allow
    > forward and backward scanning. That feature is great for text
    > processing, but not very important for compression, and it reduces
    > the number of possible N-byte combinations, which decreases
    > performance.

    But I think one must distinguish between these classical data
    compression methods, which only have as objective to reduce data
    size, or a compression format that makes the data in it usable in
    various ways.

    > Be sure you understand the job at hand, and be careful to use the
    > right tools for the job. A hammer makes a great hammer, but a
    > lousy screwdriver.

    So this is pretty much the point. Lempel-Ziv and friends are probably
    not worth to be used when compressing natural language text because
    the compression is rather poor, and it makes the contents unusable
    unless unpacked first.

    The question is then really, and there the relevance to this list may
    come in, if there are compressed formats that take care of the
    Unicode character structure enabling efficient typical natural
    language usage.

       Hans Aberg



    This archive was generated by hypermail 2.1.5 : Tue Sep 26 2006 - 02:41:15 CST