From: Ray Mullan (ray@mullan.net)
Date: Sun Dec 05 2004 - 05:44:47 CST
I don't see how the one million available codepoints in the Unicode
Standard could possibly accommodate a grammatically accurate vocabulary
of all the world's languages. You're overlooking the question of which
versions of words -- 'color' or 'colour' in English for instance --
would be used in such a system -- or shall we have all of them? There's
also the matter of words that change depending on their grammatical
usage: 'teach' meaning 'house' in Gaelic becomes 'tí' in certain cases,
'cep' meaning 'pocket' in Turkish becomes 'cebim' when it's my pocket,
'cebin' when it's your pocket, cebi when it's a third person's pocket
and cebimiz when it's our pocket -- although the heaven knows what sort
of garment might accommodate 'our pocket' at this laboured stage of the
point I'm making.
Mind you it was a nice idea that had me dreaming for a bit this morning
-- until the caffeine kicked in, that is.
Tim Finney wrote:
> Dear All
>
> This is off topic, so feel free to ignore it.
>
> The other day I was telling a co-worker about Unicode and how the UTF-8
> encoding system works. During the far ranging discussions that followed
> (we are public servants), my co-worker suggested encoding entire words
> in Unicode.
>
> This sounds like heresy to all of us who know that Unicode is meant only
> for characters. But wait a minute... Aren't there a whole lot of
> codepoints that will never be used? 231 is a big number. I imagine that
> it could contain all of the words of all of the languages as well as all
> of their characters. According to Marcus Kuhn's Unicode FAQ
> (http://www.cl.cam.ac.uk/~mgk25/unicode.html), "Current plans are that
> there will never be characters assigned outside the 21-bit code space
> from 0x000000 to 0x10FFFF, which covers a bit over one million potential
> future characters".
>
> So here is the idea: why not use the unused part (231 - 221 =
> 2,145,386,496) to encode all the words of all the languages as well. You
> could then send any word with a few bytes. This would reduce the
> bandwidth necessary to send text. (You need at most six bytes to address
> all 231 code points, and with a knowledge of word frequencies could
> assign the most frequently used words to code points that require
> smaller numbers of bytes.) Whether text represents a significant
> proportion of bandwidth use is an important question, but because
> bandwidth = money, this idea could save quite a lot, even if text only
> represents a small proportion of the total bandwidth. Phone companies
> could use encoded words for transmitting SMS messages, thereby saving
> money on new mobile tower installations, although they are going to put
> in G3 (video-capable) anyway.
>
> All of the machinery (Unicode, UTF-8, web crawlers that can work out
> what words are used most often) is already there.
>
> Someone must have already thought of this? If not, my co-worker, Zack
> Alach, deserves the kudos.
>
> Best
>
> Tim Finney
>
>
>
This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 05:52:29 CST