RE: Unicode for words?

From: Hohberger, Clive ([email protected])
Date: Sun Dec 05 2004 - 14:38:13 CST

Next message: Philippe Verdy: "Re: Unicode for words?"

Previous message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Maybe in reply to: Tim Finney: "Unicode for words?"
Next in thread: Doug Ewell: "Re: Unicode for words?"
Reply: Doug Ewell: "Re: Unicode for words?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Several years ago I wrote an English language compression routine for bar codes which encoded "the", "an", "and", etc. as single byte values. What I then discovered is that the biggest remaining single waste of codewords in English text is the spaces between words!!

When I went back and recoded those same words with leading or trailing spaces (denoted here by "_") as: "_the", "the_" "_and", "and_", etc. as single bytes, I found a huge gain in efficiency in terms of the number of bytes to encode the sma e English text. Next, when you look at the most common word starting letters and encode them as "_s" and "_t", etc., and the most common word terminator letters and encode them as "r_", "d_", etc., you gain additional efficiency in a 256-codeword alphabet/word encoding for English.

What it said to me is that from a coding efficiency viewpoint is that we need to think of words in an alphabetic language at a sequence of letters with the space as either a prefix or terminator character, rather than the space as a separator character between words represented as alphabetic strings.
Clive Hohberger

-----Original Message-----
From: [email protected] [mailto:[email protected]]On
Behalf Of D. Starner
Sent: Sunday, December 05, 2004 11:49 AM
To: [email protected]
Subject: Re: Unicode for words?

"Philippe Verdy" writes:

> Suppose that Unicode encodes the common English words "the", "an", "is", etc... then a protocol
> could decide that these words are not important and will filter them.

Drop the part of the sentence before "then". A protocol could delete "the", "an", etc. right
now. In fact, I suspect several library systems do drop "the", etc. right now. Not that this
makes it a good idea, but that's a lousy argument.

-- 
___________________________________________________________
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Next message: Philippe Verdy: "Re: Unicode for words?"
Previous message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Maybe in reply to: Tim Finney: "Unicode for words?"
Next in thread: Doug Ewell: "Re: Unicode for words?"
Reply: Doug Ewell: "Re: Unicode for words?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 14:44:23 CST