Re: Unicode for words?

From: John D. Burger (john@mitre.org)
Date: Sun Dec 05 2004 - 20:59:19 CST

  • Next message: Doug Ewell: "Re: Unicode for words?"

    > So here is the idea: why not use the unused part (231 - 221 =
    > 2,145,386,496) to encode all the words of all the languages as well.
    > You
    > could then send any word with a few bytes. This would reduce the
    > bandwidth necessary to send text. (You need at most six bytes to
    > address
    > all 231 code points, and with a knowledge of word frequencies could
    > assign the most frequently used words to code points that require
    > smaller numbers of bytes.)

    This is called text compression, and it already works pretty well -
    better than the suggested scheme would, I think, given where the code
    points are.

    As to encoding all the words in all the languages, 2 billion code
    points probably isn't enough - counting scientific terms, some
    estimates range to 2 million words in English. Multiply by all the
    languages, and you're getting to within a factor of two or so of the
    available space.

    This ignores the fact that languages grow much more quickly than you'd
    imagine. I can't find the reference, but Ken Church, I think, did some
    estimates using newswire data and found that vocabulary growth does not
    seem to asymptote - even the growth =factor= doesn't asymptote.

    Finally, this assumes that everyone could agree on what a word is.
    Many languages have no explicit word segmentation, e.g., Chinese,
    Japanese, Thai. Sorry, I can't find this reference either, but someone
    had native speakers segment Chinese text for word boundaries, and there
    was substantial disagreement. Even in English, I suspect there would
    be some disagreement, e.g., "freeform" vs "free-form" vs "free form".

    We can't even always agree on what a character is.

    - John Burger
       MITRE



    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 21:01:28 CST