Re: Unicode & space in programming & l10n

From: John D. Burger (john@mitre.org)
Date: Mon Sep 25 2006 - 12:58:16 CST

Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"

Previous message: Doug Ewell: "Re: what is the Unicode correspondent of character HORIZONTAL BAR from ISO/IEC 6397 ?"
In reply to: Hans Aberg: "Re: Unicode & space in programming & l10n"
Next in thread: Hans Aberg: "Re: Unicode & space in programming & l10n"
Reply: Hans Aberg: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg wrote:

>> On the notion of analyzing the words in text, sorting by
>> frequency, and assigning shorter code units to higher frequency
>> words for compression:
>>
>> This is typically not worth the effort - high-frequency words
>> perforce are more likely to occur earlier in the text, ...
>
> This seems to be a description how those on the fly compression
> algorithms works, rather than a description of say typical English
> texts (see link below). Why would high-frequency English words
> appear more frequently in a typical English text?

??? I'm assuming this tautological query was mis-typed. If you meant
to ask why high-frequency English words are likely to appear
=earlier= in a typical text, well, for me this is almost tautological
as well, but ...

High-frequency words are so because they occur in many sentences, and
thus they are likely to occur in the first few sentences of a typical
text. These words include prepositions, pronouns, and other "stop
words", and it's rather difficult to produce English text without
using them. The top five most frequent words from a large corpus I
am currently using are:

   the
   of
   and
   to
   in

I used all five in my first sentence above.

- John D. Burger
MITRE

Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"
Previous message: Doug Ewell: "Re: what is the Unicode correspondent of character HORIZONTAL BAR from ISO/IEC 6397 ?"
In reply to: Hans Aberg: "Re: Unicode & space in programming & l10n"
Next in thread: Hans Aberg: "Re: Unicode & space in programming & l10n"
Reply: Hans Aberg: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 13:00:54 CST