From: Hans Aberg (email@example.com)
Date: Mon Sep 25 2006 - 16:40:36 CST
On 25 Sep 2006, at 23:05, John D. Burger wrote:
>>> High-frequency words are so because they occur in many sentences,
>>> and thus they are likely to occur in the first few sentences of a
>>> typical text.
>> ??? But they appear later in the sentences as well, I would gather.
> Um ... but then they have already been assigned the first (short)
> code-words for compression. I am clearly not getting through, ...
I originally misinterpreted what you said, as in math, what you say
would be phrased something like: high frequency words are likely to
have occurrences in the first few sentences of the text.
> ...and this is clearly off-topic for the list.
Perhaps, perhaps not: it might be good to clarify the wished-for
properties of a compressed natural language text body. From time to
time, people discussing on this list, want to use this or other
Unicode character encoding for compressing purposes.
Requirements I can think of: incremental compression, fast
readability, ability to scan and search without decompressing, with
respect to different types of searches. For example, the net-searches
available, are often too crude for being linguistically useful.
A book says that say 'compress', which uses LZW <http://
en.wikipedia.org/wiki/LZW>, only gives 50-60% compression on 1 MB
English text, and todays disk space is so cheap, it may be not be
worth the effort. The given link says: "The algorithm is designed to
be fast to implement but not necessarily optimal since it does not
perform any analysis on the data."
So it is not clear to me that compressing natural language text
bodies using standard computer data compression tools will fulfill
the needs of typical usage.
This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 16:48:25 CST