Re: Unicode for words?

From: John D. Burger (john@mitre.org)
Date: Sun Dec 05 2004 - 20:59:19 CST

Next message: Doug Ewell: "Re: Unicode for words?"

Previous message: Chris Jacobs: "Re: No Invisible Character - NBSP at the start of a word"
In reply to: Tim Finney: "Unicode for words?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> So here is the idea: why not use the unused part (231 - 221 =
> 2,145,386,496) to encode all the words of all the languages as well.
> You
> could then send any word with a few bytes. This would reduce the
> bandwidth necessary to send text. (You need at most six bytes to
> address
> all 231 code points, and with a knowledge of word frequencies could
> assign the most frequently used words to code points that require
> smaller numbers of bytes.)

This is called text compression, and it already works pretty well -
better than the suggested scheme would, I think, given where the code
points are.

As to encoding all the words in all the languages, 2 billion code
points probably isn't enough - counting scientific terms, some
estimates range to 2 million words in English. Multiply by all the
languages, and you're getting to within a factor of two or so of the
available space.

This ignores the fact that languages grow much more quickly than you'd
imagine. I can't find the reference, but Ken Church, I think, did some
estimates using newswire data and found that vocabulary growth does not
seem to asymptote - even the growth =factor= doesn't asymptote.

Finally, this assumes that everyone could agree on what a word is.
Many languages have no explicit word segmentation, e.g., Chinese,
Japanese, Thai. Sorry, I can't find this reference either, but someone
had native speakers segment Chinese text for word boundaries, and there
was substantial disagreement. Even in English, I suspect there would
be some disagreement, e.g., "freeform" vs "free-form" vs "free form".

We can't even always agree on what a character is.

- John Burger
MITRE

Next message: Doug Ewell: "Re: Unicode for words?"
Previous message: Chris Jacobs: "Re: No Invisible Character - NBSP at the start of a word"
In reply to: Tim Finney: "Unicode for words?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 21:01:28 CST