From: John D. Burger (john@mitre.org)
Date: Mon Sep 25 2006 - 12:58:16 CST
Hans Aberg wrote:
>> On the notion of analyzing the words in text, sorting by  
>> frequency, and assigning shorter code units to higher frequency  
>> words for compression:
>>
>> This is typically not worth the effort - high-frequency words  
>> perforce are more likely to occur earlier in the text, ...
>
> This seems to be a description how those on the fly compression  
> algorithms works, rather than a description of say typical English  
> texts (see link below). Why would high-frequency English words  
> appear more frequently in a typical English text?
??? I'm assuming this tautological query was mis-typed.  If you meant  
to ask why high-frequency English words are likely to appear  
=earlier= in a typical text, well, for me this is almost tautological  
as well, but ...
High-frequency words are so because they occur in many sentences, and  
thus they are likely to occur in the first few sentences of a typical  
text.  These words include prepositions, pronouns, and other "stop  
words", and it's rather difficult to produce English text without  
using them.  The top five most frequent words from a large corpus I  
am currently using are:
   the
   of
   and
   to
   in
I used all five in my first sentence above.
- John D. Burger
   MITRE
This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 13:00:54 CST