RE: length of text by different languages

From: Francois Yergeau (
Date: Wed Mar 05 2003 - 21:09:41 EST

  • Next message: Doug Ewell: "Re: length of text by different languages" wrote:
    > I remember there were some study to show although UTF-8 encode each
    > Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use
    > LESS characters in writting to communicate information than
    > alphabetic base langauges.
    > Any one can point to me such research?

    I don't know of exactly what you want, but I vaguely remember a paper given
    at a Unicode conference long ago that compared various translations of the
    charter (or some such) of the Voice of America in a couple or three
    encodings. Hmmmm, let's see.... could be this:
    Reuters Compression Scheme for Unicode (RCSU)
    Misha Wolf

    No paper online, alas. I remember that Chinese was a clear winner in terms
    of # of characters. In fact, I kind of remember that Chinese was so much
    denser that it still won after RCSU (now SCSU) compression, which would mean
    that a Han character contains more than twice as much info on average as a
    Latin letter as used in (say) English.

    This is all on pretty shaky ground, distant memories. Perhaps Misha stil
    has the figures (if that's in fact the right paper).


    This archive was generated by hypermail 2.1.5 : Wed Mar 05 2003 - 22:01:04 EST