Re: Unicode for words?

From: Doug Ewell (
Date: Sun Dec 05 2004 - 21:02:15 CST

  • Next message: Doug Ewell: "SCSU as internal encoding (was: Re: Nicest UTF)"

    Hohberger, Clive <CHohberger at zebra dot com> wrote:

    > When I went back and recoded those same words with leading or trailing
    > spaces (denoted here by "_") as: "_the", "the_" "_and", "and_", etc.
    > as single bytes, I found a huge gain in efficiency in terms of the
    > number of bytes to encode the sma e English text. Next, when you look
    > at the most common word starting letters and encode them as "_s" and
    > "_t", etc., and the most common word terminator letters and encode
    > them as "r_", "d_", etc., you gain additional efficiency in a 256-
    > codeword alphabet/word encoding for English.
    > What it said to me is that from a coding efficiency viewpoint is that
    > we need to think of words in an alphabetic language at a sequence of
    > letters with the space as either a prefix or terminator character,
    > rather than the space as a separator character between words
    > represented as alphabetic strings.

    A word-based encoding for English could automatically assume spaces
    where they are appropriate. The sentence:

    "What means this, my lord?"

    would have seven encodable elements: the five words, the comma, and the
    question mark. Spaces would be automatically filled in as needed, not
    explicitly encoded. This implies "standard" English punctuation and
    spacing conventions, however that is defined. For French conventions,
    there would probably be a space before the question mark as well.

    Such an encoding would probably also include logic to capitalize the
    first word of each sentence, plus the ability to override this logic for
    proper names and non-capitalized sentences. There might also be
    unification of conjugations and declensions (and similar for other
    languages) to conserve space. "Boy" and "boys" might be encoded with
    the same code point, with contextual clues elsewhere in the sentence to
    disambiguate the two.

    And, of course, there would have to be an escape mechanism to ordinary
    character-based encoding, because such a system will never contain every
    word one might wish to encode, even just for English (think proper names
    again), and because "standard" punctuation and spacing rules don't
    always apply. This is similar to the situation with sign languages,
    which are word- and phrase-based but allow a fallback to fingerspelling.

    None of this, however interesting it may be, has anything to do with
    Unicode. Unicode is a system for encoding characters, not words or
    pictures or ideas.

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 21:04:42 CST