Re: Unicode for words?

From: Ray Mullan (ray@mullan.net)
Date: Sun Dec 05 2004 - 05:44:47 CST

  • Next message: Philippe Verdy: "Re: Nicest UTF"

    I don't see how the one million available codepoints in the Unicode
    Standard could possibly accommodate a grammatically accurate vocabulary
    of all the world's languages. You're overlooking the question of which
    versions of words -- 'color' or 'colour' in English for instance --
    would be used in such a system -- or shall we have all of them? There's
    also the matter of words that change depending on their grammatical
    usage: 'teach' meaning 'house' in Gaelic becomes 'tí' in certain cases,
    'cep' meaning 'pocket' in Turkish becomes 'cebim' when it's my pocket,
    'cebin' when it's your pocket, cebi when it's a third person's pocket
    and cebimiz when it's our pocket -- although the heaven knows what sort
    of garment might accommodate 'our pocket' at this laboured stage of the
    point I'm making.

    Mind you it was a nice idea that had me dreaming for a bit this morning
    -- until the caffeine kicked in, that is.

    Tim Finney wrote:

    > Dear All
    >
    > This is off topic, so feel free to ignore it.
    >
    > The other day I was telling a co-worker about Unicode and how the UTF-8
    > encoding system works. During the far ranging discussions that followed
    > (we are public servants), my co-worker suggested encoding entire words
    > in Unicode.
    >
    > This sounds like heresy to all of us who know that Unicode is meant only
    > for characters. But wait a minute... Aren't there a whole lot of
    > codepoints that will never be used? 231 is a big number. I imagine that
    > it could contain all of the words of all of the languages as well as all
    > of their characters. According to Marcus Kuhn's Unicode FAQ
    > (http://www.cl.cam.ac.uk/~mgk25/unicode.html), "Current plans are that
    > there will never be characters assigned outside the 21-bit code space
    > from 0x000000 to 0x10FFFF, which covers a bit over one million potential
    > future characters".
    >
    > So here is the idea: why not use the unused part (231 - 221 =
    > 2,145,386,496) to encode all the words of all the languages as well. You
    > could then send any word with a few bytes. This would reduce the
    > bandwidth necessary to send text. (You need at most six bytes to address
    > all 231 code points, and with a knowledge of word frequencies could
    > assign the most frequently used words to code points that require
    > smaller numbers of bytes.) Whether text represents a significant
    > proportion of bandwidth use is an important question, but because
    > bandwidth = money, this idea could save quite a lot, even if text only
    > represents a small proportion of the total bandwidth. Phone companies
    > could use encoded words for transmitting SMS messages, thereby saving
    > money on new mobile tower installations, although they are going to put
    > in G3 (video-capable) anyway.
    >
    > All of the machinery (Unicode, UTF-8, web crawlers that can work out
    > what words are used most often) is already there.
    >
    > Someone must have already thought of this? If not, my co-worker, Zack
    > Alach, deserves the kudos.
    >
    > Best
    >
    > Tim Finney
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 05:52:29 CST