Re: Unicode & space in programming & l10n

From: John D. Burger (john@mitre.org)
Date: Mon Sep 25 2006 - 12:58:16 CST

  • Next message: Hans Aberg: "Re: Unicode & space in programming & l10n"

    Hans Aberg wrote:

    >> On the notion of analyzing the words in text, sorting by
    >> frequency, and assigning shorter code units to higher frequency
    >> words for compression:
    >>
    >> This is typically not worth the effort - high-frequency words
    >> perforce are more likely to occur earlier in the text, ...
    >
    > This seems to be a description how those on the fly compression
    > algorithms works, rather than a description of say typical English
    > texts (see link below). Why would high-frequency English words
    > appear more frequently in a typical English text?

    ??? I'm assuming this tautological query was mis-typed. If you meant
    to ask why high-frequency English words are likely to appear
    =earlier= in a typical text, well, for me this is almost tautological
    as well, but ...

    High-frequency words are so because they occur in many sentences, and
    thus they are likely to occur in the first few sentences of a typical
    text. These words include prepositions, pronouns, and other "stop
    words", and it's rather difficult to produce English text without
    using them. The top five most frequent words from a large corpus I
    am currently using are:

       the
       of
       and
       to
       in

    I used all five in my first sentence above.

    - John D. Burger
       MITRE



    This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 13:00:54 CST