Re: length of text by different languages

From: Jon Babcock (jon@kanji.com)
Date: Thu Mar 06 2003 - 11:05:13 EST

  • Next message: Chris Jacobs: "Re: The display of *kholam* on PCs"

    Yung-Fong Tang wrote:
    > I remember there were some study to show although UTF-8 encode each
    > Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use
    > LESS characters in writting to communicate information than alphabetic
    > base langauges.

    For my commercial Japanese-to-English translation work, I
    estimate from 2.3 to 3.2 Japanese characters for one word of
    English, estimated at 6 characters. It varies depending on the
    kanji to kana ratio in the source text.

    For commercial contemporary Chinese-to-English translation, I
    estimate 1.4 to 1.8 Chinese characters per English word,
    estimated at 6 characters. (I just asked about this on a mailing
    list devoted to C-E/E-C translation and the one translator who
    responded said he uses 1.62 Chinese characters per English word
    which agrees with my experience.)

    Since a "word" is probably about the smallest chunk of meaning
    you're going find, this would suggest that where it takes 6
    bytes to encode a word of English at one-byte per character, at
    3 bytes per character, it will take from about 4.3 to 3.3 bytes
    to encode a word of Chinese, I guess.

    The above applies to contemporary (modern) traditional Chinese.
    I don't know if there is a practical difference in efficiency
    between traditonal and simplified. But from my experience with
    classical Chinese, I would guess that most classical Chinese is
    at least twice as efficient as modern Chinese. (This, plus its
    freedom from any tight dependence on sound, facilitated its
    great success as the language of culture throughout the
    traditional kanji culture realm --- China, Korea, Japan,
    Vietnam, etc., imo.)

    FWIW,

    Jon

    -- 
    Jon Babcock <jon@kanji.com>
    


    This archive was generated by hypermail 2.1.5 : Thu Mar 06 2003 - 11:56:45 EST