Re: Unicode & space in programming & l10n

From: Doug Ewell (
Date: Sun Sep 24 2006 - 17:46:20 CST

  • Next message: Philippe Verdy: "Re: Question about formatting numerals"

    John D. Burger <john at mitre dot org> wrote:

    > On the notion of analyzing the words in text, sorting by frequency,
    > and assigning shorter code units to higher frequency words for
    > compression:
    > This is typically not worth the effort - high-frequency words perforce
    > are more likely to occur earlier in the text, and thus are given short
    > code words with no such analysis needed. Moreover, not defining what
    > a "word" is lets Ziv-Lempel and friends discover subwords and
    > multi-word sequences automagically. They essentially do stemming
    > without knowing anything about language at all.

    This was a special-purpose project that I rolled myself, where
    compression happens only once and decompression happens repeatedly, and
    where I elected to use a simpler and lighter-weight mechanism than LZ.

    > Also remember that compression ratio is not the only figure of merit -
    > compression speed is also important.

    Point well taken. My impression is that the approach I took, for its
    limited purpose, is comparable to LZ in speed, but that's just a guess
    since I haven't profiled either one.

    Doug Ewell
    Fullerton, California, USA
    RFC 4645  *  UTN #14

    This archive was generated by hypermail 2.1.5 : Sun Sep 24 2006 - 18:05:49 CST