Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Mon Sep 25 2006 - 14:16:13 CST

Next message: John D. Burger: "Re: Unicode & space in programming & l10n"

Previous message: John D. Burger: "Re: Unicode & space in programming & l10n"
In reply to: John D. Burger: "Re: Unicode & space in programming & l10n"
Next in thread: John D. Burger: "Re: Unicode & space in programming & l10n"
Reply: John D. Burger: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 25 Sep 2006, at 20:58, John D. Burger wrote:

> Hans Aberg wrote:
>
>>> On the notion of analyzing the words in text, sorting by
>>> frequency, and assigning shorter code units to higher frequency
>>> words for compression:
>>>
>>> This is typically not worth the effort - high-frequency words
>>> perforce are more likely to occur earlier in the text, ...
>>
>> This seems to be a description how those on the fly compression
>> algorithms works, rather than a description of say typical English
>> texts (see link below). Why would high-frequency English words
>> appear more frequently in a typical English text?
>
> ??? I'm assuming this tautological query was mis-typed. If you
> meant to ask why high-frequency English words are likely to appear
> =earlier= in a typical text, well, for me this is almost
> tautological as well, but ...
>
> High-frequency words are so because they occur in many sentences,
> and thus they are likely to occur in the first few sentences of a
> typical text.

??? But they appear later in the sentences as well, I would gather.

> These words include prepositions, pronouns, and other "stop words",
> and it's rather difficult to produce English text without using
> them. The top five most frequent words from a large corpus I am
> currently using are:
>
> the
> of
> and
> to
> in
>
> I used all five in my first sentence above.

And how do you know which are the more frequent ones by merely
looking at the first few sentences. And if one collects a
considerable number of them, the most frequent words would not even
fit into the first few sentences.

Hans Aberg

Next message: John D. Burger: "Re: Unicode & space in programming & l10n"
Previous message: John D. Burger: "Re: Unicode & space in programming & l10n"
In reply to: John D. Burger: "Re: Unicode & space in programming & l10n"
Next in thread: John D. Burger: "Re: Unicode & space in programming & l10n"
Reply: John D. Burger: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 14:20:10 CST