Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Fri Sep 22 2006 - 07:00:02 CDT

Next message: Mark Cilia Vincenti: "Problem with SSI and BOM"

Previous message: Elliotte Harold: "Re: Fw: Unicode & space in programming & l10n"
In reply to: Doug Ewell: "Re: Unicode & space in programming & l10n"
Next in thread: Hans Aberg: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 22 Sep 2006, at 05:39, Doug Ewell wrote:

>> So then, why not (if this is not what you already is doing) just
>> take a large English text body, and compute the statistics of the
>> words in it. Then sort the list, putting the more frequent words
>> first, and give the words the number they have in this list. Then
>> apply UTF-8...
>
> This would be intended as a general-purpose scheme, of course, not
> for the specific purpose I cited of character names, which are
> nowhere near representative of English word frequency.

Well, any compression scheme is only effective on certain types of
data, so more than one will be needed. One interesting example I once
saw, was that applying a typical compression method to DNA data gave
0 % compression, despite we know that the DNA data is highly
structured - the compression method just doesn't recognize it.

> You bring up some interesting points, some of which I've already
> thought of -- particularly the ability to fall back to character-by-
> character spelling of rarer words, just as sign languages include a
> fallback to fingerspelling.

Yes, you seem to be down the same line of thoughts.

> One possible pitfall is the number of "common" words in English;
> the more words are assigned tokens, the greater the average (or
> longest) token size. You have to decide where to draw the line.

Yes, I have thought of what this cutoff might be. I suspect that even
though English may use hundreds of thousands of words, especially
when derivations are counted, only a few thousand are sufficiently
frequent and long, making it worth to be given special encoding. If
the table is made fixed, its size will not be so relevant on todays
and future computers, and then it is mainly important to provide
efficient decoding, as it is not strictly necessary to encode a word,
in view of that character encoding can be used.

> This is really becoming OT for the Unicode list, ...

The focus in this list is really that there seem to be frequent
discussions over the compression properties of some of the official
Unicode character encodings. So it seems me, if compression is an
issue, one might provide a few methods that provide considerably
better results.

> ...but I'll be happy to discuss it further in private mail.

That is fine to. :-)

Hans Aberg

Next message: Mark Cilia Vincenti: "Problem with SSI and BOM"
Previous message: Elliotte Harold: "Re: Fw: Unicode & space in programming & l10n"
In reply to: Doug Ewell: "Re: Unicode & space in programming & l10n"
Next in thread: Hans Aberg: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Sep 22 2006 - 07:02:32 CDT