From: John D. Burger (john@mitre.org)
Date: Sun Dec 05 2004 - 20:59:19 CST
> So here is the idea: why not use the unused part (231 - 221 =
> 2,145,386,496) to encode all the words of all the languages as well.
> You
> could then send any word with a few bytes. This would reduce the
> bandwidth necessary to send text. (You need at most six bytes to
> address
> all 231 code points, and with a knowledge of word frequencies could
> assign the most frequently used words to code points that require
> smaller numbers of bytes.)
This is called text compression, and it already works pretty well -
better than the suggested scheme would, I think, given where the code
points are.
As to encoding all the words in all the languages, 2 billion code
points probably isn't enough - counting scientific terms, some
estimates range to 2 million words in English. Multiply by all the
languages, and you're getting to within a factor of two or so of the
available space.
This ignores the fact that languages grow much more quickly than you'd
imagine. I can't find the reference, but Ken Church, I think, did some
estimates using newswire data and found that vocabulary growth does not
seem to asymptote - even the growth =factor= doesn't asymptote.
Finally, this assumes that everyone could agree on what a word is.
Many languages have no explicit word segmentation, e.g., Chinese,
Japanese, Thai. Sorry, I can't find this reference either, but someone
had native speakers segment Chinese text for word boundaries, and there
was substantial disagreement. Even in English, I suspect there would
be some disagreement, e.g., "freeform" vs "free-form" vs "free form".
We can't even always agree on what a character is.
- John Burger
MITRE
This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 21:01:28 CST