Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Fri Sep 22 2006 - 07:00:02 CDT

  • Next message: Mark Cilia Vincenti: "Problem with SSI and BOM"

    On 22 Sep 2006, at 05:39, Doug Ewell wrote:

    >> So then, why not (if this is not what you already is doing) just
    >> take a large English text body, and compute the statistics of the
    >> words in it. Then sort the list, putting the more frequent words
    >> first, and give the words the number they have in this list. Then
    >> apply UTF-8...
    >
    > This would be intended as a general-purpose scheme, of course, not
    > for the specific purpose I cited of character names, which are
    > nowhere near representative of English word frequency.

    Well, any compression scheme is only effective on certain types of
    data, so more than one will be needed. One interesting example I once
    saw, was that applying a typical compression method to DNA data gave
    0 % compression, despite we know that the DNA data is highly
    structured - the compression method just doesn't recognize it.

    > You bring up some interesting points, some of which I've already
    > thought of -- particularly the ability to fall back to character-by-
    > character spelling of rarer words, just as sign languages include a
    > fallback to fingerspelling.

    Yes, you seem to be down the same line of thoughts.

    > One possible pitfall is the number of "common" words in English;
    > the more words are assigned tokens, the greater the average (or
    > longest) token size. You have to decide where to draw the line.

    Yes, I have thought of what this cutoff might be. I suspect that even
    though English may use hundreds of thousands of words, especially
    when derivations are counted, only a few thousand are sufficiently
    frequent and long, making it worth to be given special encoding. If
    the table is made fixed, its size will not be so relevant on todays
    and future computers, and then it is mainly important to provide
    efficient decoding, as it is not strictly necessary to encode a word,
    in view of that character encoding can be used.

    > This is really becoming OT for the Unicode list, ...

    The focus in this list is really that there seem to be frequent
    discussions over the compression properties of some of the official
    Unicode character encodings. So it seems me, if compression is an
    issue, one might provide a few methods that provide considerably
    better results.

    > ...but I'll be happy to discuss it further in private mail.

    That is fine to. :-)

       Hans Aberg



    This archive was generated by hypermail 2.1.5 : Fri Sep 22 2006 - 07:02:32 CDT