Re: Unicode & space in programming & l10n

From: Doug Ewell (
Date: Thu Sep 21 2006 - 22:39:21 CDT

  • Next message: Doug Ewell: "Re: Unicode & space in programming & l10n"

    Hans Aberg <haberg at math dot su dot se> wrote:

    > So then, why not (if this is not what you already is doing) just take
    > a large English text body, and compute the statistics of the words in
    > it. Then sort the list, putting the more frequent words first, and
    > give the words the number they have in this list. Then apply UTF-8...

    This would be intended as a general-purpose scheme, of course, not for
    the specific purpose I cited of character names, which are nowhere near
    representative of English word frequency.

    You bring up some interesting points, some of which I've already thought
    of -- particularly the ability to fall back to character-by-character
    spelling of rarer words, just as sign languages include a fallback to
    fingerspelling. One possible pitfall is the number of "common" words in
    English; the more words are assigned tokens, the greater the average (or
    longest) token size. You have to decide where to draw the line.

    This is really becoming OT for the Unicode list, but I'll be happy to
    discuss it further in private mail.

    Doug Ewell
    Fullerton, California, USA
    RFC 4645  *  UTN #14

    This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 22:44:01 CDT