Re: Unicode & space in programming & l10n

From: Hans Aberg (
Date: Mon Sep 25 2006 - 16:40:36 CST

  • Next message: Doug Ewell: "Re: Unicode & space in programming & l10n"

    On 25 Sep 2006, at 23:05, John D. Burger wrote:

    >>> High-frequency words are so because they occur in many sentences,
    >>> and thus they are likely to occur in the first few sentences of a
    >>> typical text.
    >> ??? But they appear later in the sentences as well, I would gather.
    > Um ... but then they have already been assigned the first (short)
    > code-words for compression. I am clearly not getting through, ...

    I originally misinterpreted what you said, as in math, what you say
    would be phrased something like: high frequency words are likely to
    have occurrences in the first few sentences of the text.

    > ...and this is clearly off-topic for the list.

    Perhaps, perhaps not: it might be good to clarify the wished-for
    properties of a compressed natural language text body. From time to
    time, people discussing on this list, want to use this or other
    Unicode character encoding for compressing purposes.

    Requirements I can think of: incremental compression, fast
    readability, ability to scan and search without decompressing, with
    respect to different types of searches. For example, the net-searches
    available, are often too crude for being linguistically useful.

    A book says that say 'compress', which uses LZW <http://>, only gives 50-60% compression on 1 MB
    English text, and todays disk space is so cheap, it may be not be
    worth the effort. The given link says: "The algorithm is designed to
    be fast to implement but not necessarily optimal since it does not
    perform any analysis on the data."

    So it is not clear to me that compressing natural language text
    bodies using standard computer data compression tools will fulfill
    the needs of typical usage.

       Hans Aberg

    This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 16:48:25 CST