From: Hohberger, Clive (CHohberger@zebra.com)
Date: Sun Dec 05 2004 - 14:38:13 CST
Several years ago I wrote an English language compression routine for bar codes which encoded "the", "an", "and", etc. as single byte values. What I then discovered is that the biggest remaining single waste of codewords in English text is the spaces between words!!
When I went back and recoded those same words with leading or trailing spaces (denoted here by "_") as: "_the", "the_" "_and", "and_", etc. as single bytes, I found a huge gain in efficiency in terms of the number of bytes to encode the sma e English text. Next, when you look at the most common word starting letters and encode them as "_s" and "_t", etc., and the most common word terminator letters and encode them as "r_", "d_", etc., you gain additional efficiency in a 256-codeword alphabet/word encoding for English.
What it said to me is that from a coding efficiency viewpoint is that we need to think of words in an alphabetic language at a sequence of letters with the space as either a prefix or terminator character, rather than the space as a separator character between words represented as alphabetic strings.
From: email@example.com [mailto:firstname.lastname@example.org]On
Behalf Of D. Starner
Sent: Sunday, December 05, 2004 11:49 AM
Subject: Re: Unicode for words?
"Philippe Verdy" writes:
> Suppose that Unicode encodes the common English words "the", "an", "is", etc... then a protocol
> could decide that these words are not important and will filter them.
Drop the part of the sentence before "then". A protocol could delete "the", "an", etc. right
now. In fact, I suspect several library systems do drop "the", etc. right now. Not that this
makes it a good idea, but that's a lousy argument.
-- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 14:44:23 CST