From: Doug Ewell (dewell@adelphia.net)
Date: Mon Sep 25 2006 - 17:39:43 CST
Hans Aberg <haberg at math dot su dot se> wrote:
> I originally misinterpreted what you said, as in math, what you say 
> would be phrased something like: high frequency words are likely to 
> have occurrences in the first few sentences of the text.
That is what he said.  High-frequency words are more likely to occur 
everywhere within the text.  That's what makes them high-frequency.
>> ...and this is clearly off-topic for the list.
>
> Perhaps, perhaps not: it might be good to clarify the wished-for 
> properties of a compressed natural language text body. From time to 
> time, people discussing on this list, want to use this or other 
> Unicode character encoding for compressing purposes.
*red flag*
It's always dangerous to think in terms of using Unicode character 
encoding schemes for compression, because:
1.  if the data being compressed is not "Unicode code points," but looks 
like it, there is a chance of misinterpreting the data and confusing the 
two issues, and
2.  most Unicode character encoding schemes are not intended, or 
optimal, for compression.
You can certainly build a compression model that encodes frequent items 
in fewer bits and rare items in more bits -- that's pretty much what all 
compression methods do -- and you can apply some of the concepts 
employed in UTF-8, or double-byte character sets like JIS, to help build 
this model.
But as soon as you start "using UTF-8" or some other Unicode CES to 
compress non-Unicode data, you are not only missing the point of 
UTF-8 -- it's intended for ASCII transparency combined with complete 
Unicode coverage, NOT for compression -- but you are getting a rather 
poor general-purpose compression model to boot.
For example, one of the most commonly mentioned beneficial features of 
UTF-8 (maybe too commonly) is that the byte patterns allow forward and 
backward scanning.  That feature is great for text processing, but not 
very important for compression, and it reduces the number of possible 
N-byte combinations, which decreases performance.
Be sure you understand the job at hand, and be careful to use the right 
tools for the job.  A hammer makes a great hammer, but a lousy 
screwdriver.
-- Doug Ewell Fullerton, California, USA http://users.adelphia.net/~dewell/ RFC 4645 * UTN #14
This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 17:43:25 CST