Re: Unicode & space in programming & l10n

From: Doug Ewell (dewell@adelphia.net)
Date: Mon Sep 25 2006 - 17:39:43 CST

Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"

Previous message: Hans Aberg: "Re: Unicode & space in programming & l10n"
In reply to: Hans Aberg: "Re: Unicode & space in programming & l10n"
Next in thread: Hans Aberg: "Re: Unicode & space in programming & l10n"
Reply: Hans Aberg: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg <haberg at math dot su dot se> wrote:

> I originally misinterpreted what you said, as in math, what you say
> would be phrased something like: high frequency words are likely to
> have occurrences in the first few sentences of the text.

That is what he said. High-frequency words are more likely to occur
everywhere within the text. That's what makes them high-frequency.

>> ...and this is clearly off-topic for the list.
>
> Perhaps, perhaps not: it might be good to clarify the wished-for
> properties of a compressed natural language text body. From time to
> time, people discussing on this list, want to use this or other
> Unicode character encoding for compressing purposes.

*red flag*

It's always dangerous to think in terms of using Unicode character
encoding schemes for compression, because:

1. if the data being compressed is not "Unicode code points," but looks
like it, there is a chance of misinterpreting the data and confusing the
two issues, and

2. most Unicode character encoding schemes are not intended, or
optimal, for compression.

You can certainly build a compression model that encodes frequent items
in fewer bits and rare items in more bits -- that's pretty much what all
compression methods do -- and you can apply some of the concepts
employed in UTF-8, or double-byte character sets like JIS, to help build
this model.

But as soon as you start "using UTF-8" or some other Unicode CES to
compress non-Unicode data, you are not only missing the point of
UTF-8 -- it's intended for ASCII transparency combined with complete
Unicode coverage, NOT for compression -- but you are getting a rather
poor general-purpose compression model to boot.

For example, one of the most commonly mentioned beneficial features of
UTF-8 (maybe too commonly) is that the byte patterns allow forward and
backward scanning. That feature is great for text processing, but not
very important for compression, and it reduces the number of possible
N-byte combinations, which decreases performance.

Be sure you understand the job at hand, and be careful to use the right
tools for the job. A hammer makes a great hammer, but a lousy
screwdriver.

--
Doug Ewell
Fullerton, California, USA
http://users.adelphia.net/~dewell/
RFC 4645  *  UTN #14

Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"
Previous message: Hans Aberg: "Re: Unicode & space in programming & l10n"
In reply to: Hans Aberg: "Re: Unicode & space in programming & l10n"
Next in thread: Hans Aberg: "Re: Unicode & space in programming & l10n"
Reply: Hans Aberg: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 17:43:25 CST