RE: Unicode for words?

From: Hohberger, Clive (
Date: Sun Dec 05 2004 - 14:38:13 CST

  • Next message: Philippe Verdy: "Re: Unicode for words?"

    Several years ago I wrote an English language compression routine for bar codes which encoded "the", "an", "and", etc. as single byte values. What I then discovered is that the biggest remaining single waste of codewords in English text is the spaces between words!!

    When I went back and recoded those same words with leading or trailing spaces (denoted here by "_") as: "_the", "the_" "_and", "and_", etc. as single bytes, I found a huge gain in efficiency in terms of the number of bytes to encode the sma e English text. Next, when you look at the most common word starting letters and encode them as "_s" and "_t", etc., and the most common word terminator letters and encode them as "r_", "d_", etc., you gain additional efficiency in a 256-codeword alphabet/word encoding for English.

    What it said to me is that from a coding efficiency viewpoint is that we need to think of words in an alphabetic language at a sequence of letters with the space as either a prefix or terminator character, rather than the space as a separator character between words represented as alphabetic strings.
    Clive Hohberger

    -----Original Message-----
    From: []On
    Behalf Of D. Starner
    Sent: Sunday, December 05, 2004 11:49 AM
    Subject: Re: Unicode for words?

    "Philippe Verdy" writes:

    > Suppose that Unicode encodes the common English words "the", "an", "is", etc... then a protocol
    > could decide that these words are not important and will filter them.

    Drop the part of the sentence before "then". A protocol could delete "the", "an", etc. right
    now. In fact, I suspect several library systems do drop "the", etc. right now. Not that this
    makes it a good idea, but that's a lousy argument.

    Sign-up for Ads Free at

    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 14:44:23 CST