Re: Unicode & space in programming & l10n

From: John D. Burger (john@mitre.org)
Date: Fri Sep 22 2006 - 21:28:06 CDT

  • Next message: Philippe Verdy: "Re: Problem with SSI and BOM"

    On the notion of analyzing the words in text, sorting by frequency,
    and assigning shorter code units to higher frequency words for
    compression:

    This is typically not worth the effort - high-frequency words
    perforce are more likely to occur earlier in the text, and thus are
    given short code words with no such analysis needed. Moreover, not
    defining what a "word" is lets Ziv-Lempel and friends discover
    subwords and multi-word sequences automagically. They essentially do
    stemming without knowing anything about language at all.

    Also remember that compression ratio is not the only figure of merit
    - compression speed is also important.

    - John Burger
       MITRE



    This archive was generated by hypermail 2.1.5 : Fri Sep 22 2006 - 21:33:24 CDT