Re: Unicode & space in programming & l10n

From: John D. Burger (john@mitre.org)
Date: Fri Sep 22 2006 - 21:28:06 CDT

Next message: Philippe Verdy: "Re: Problem with SSI and BOM"

Previous message: Steve Summit: "Re: Unicode & space in programming & l10n"
In reply to: Doug Ewell: "Re: Unicode & space in programming & l10n"
Next in thread: Hans Aberg: "Re: Unicode & space in programming & l10n"
Reply: Hans Aberg: "Re: Unicode & space in programming & l10n"
Reply: Doug Ewell: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On the notion of analyzing the words in text, sorting by frequency,
and assigning shorter code units to higher frequency words for
compression:

This is typically not worth the effort - high-frequency words
perforce are more likely to occur earlier in the text, and thus are
given short code words with no such analysis needed. Moreover, not
defining what a "word" is lets Ziv-Lempel and friends discover
subwords and multi-word sequences automagically. They essentially do
stemming without knowing anything about language at all.

Also remember that compression ratio is not the only figure of merit
- compression speed is also important.

- John Burger
MITRE

Next message: Philippe Verdy: "Re: Problem with SSI and BOM"
Previous message: Steve Summit: "Re: Unicode & space in programming & l10n"
In reply to: Doug Ewell: "Re: Unicode & space in programming & l10n"
Next in thread: Hans Aberg: "Re: Unicode & space in programming & l10n"
Reply: Hans Aberg: "Re: Unicode & space in programming & l10n"
Reply: Doug Ewell: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Sep 22 2006 - 21:33:24 CDT