Re: Unicode & space in programming & l10n

From: Hans Aberg (
Date: Sat Sep 23 2006 - 07:28:21 CDT

  • Next message: "Re: Unicode 5.0 success"

    On 23 Sep 2006, at 04:28, John D. Burger wrote:

    > On the notion of analyzing the words in text, sorting by frequency,
    > and assigning shorter code units to higher frequency words for
    > compression:
    > This is typically not worth the effort - high-frequency words
    > perforce are more likely to occur earlier in the text, ...

    This seems to be a description how those on the fly compression
    algorithms works, rather than a description of say typical English
    texts (see link below). Why would high-frequency English words appear
    more frequently in a typical English text?

    > ...and thus are given short code words with no such analysis
    > needed. Moreover, not defining what a "word" is lets Ziv-Lempel
    > and friends discover subwords and multi-word sequences
    > automagically. They essentially do stemming without knowing
    > anything about language at all.

    And <> says:
       The algorithm is designed to be fast to implement but not
    necessarily optimal since it does not perform any analysis on the data.

    So they work so because of other reasons than obtaining an efficient

    > Also remember that compression ratio is not the only figure of
    > merit - compression speed is also important.

    Well, one type of application in mind is very large linguistical
    databases - compressing the whole Wikipedia was one example.

    So at least in some circumstances, the main interest will be to have
    a database that is fairly compact and fast readable/searchable.

    And there isn't one compression algorithm that will fit all needs.

       Hans Aberg

    This archive was generated by hypermail 2.1.5 : Sat Sep 23 2006 - 07:31:39 CDT