RE: length of text by different languages

From: Francois Yergeau (FYergeau@alis.com)
Date: Wed Mar 05 2003 - 21:09:41 EST

  • Next message: Doug Ewell: "Re: length of text by different languages"

    ftang@netscape.com wrote:
    > I remember there were some study to show although UTF-8 encode each
    > Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use
    > LESS characters in writting to communicate information than
    > alphabetic base langauges.
    >
    > Any one can point to me such research?

    I don't know of exactly what you want, but I vaguely remember a paper given
    at a Unicode conference long ago that compared various translations of the
    charter (or some such) of the Voice of America in a couple or three
    encodings. Hmmmm, let's see.... could be this:

    http://www.unicode.org/iuc/iuc9/Friday2.html#b3
    Reuters Compression Scheme for Unicode (RCSU)
    Misha Wolf

    No paper online, alas. I remember that Chinese was a clear winner in terms
    of # of characters. In fact, I kind of remember that Chinese was so much
    denser that it still won after RCSU (now SCSU) compression, which would mean
    that a Han character contains more than twice as much info on average as a
    Latin letter as used in (say) English.

    This is all on pretty shaky ground, distant memories. Perhaps Misha stil
    has the figures (if that's in fact the right paper).

    -- 
    François
    


    This archive was generated by hypermail 2.1.5 : Wed Mar 05 2003 - 22:01:04 EST