length of text by different languages

From: Yung-Fong Tang (ftang@netscape.com)
Date: Wed Mar 05 2003 - 18:55:18 EST

  • Next message: jameskass@att.net: "RE: Ya-phalaa"

    I remember there were some study to show although UTF-8 encode each
    Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use
    LESS characters in writting to communicate information than alphabetic
    base langauges.

    Any one can point to me such research? Martin, do you have some paper
    about that ?

    I would like to find out the average ration between

    in term of the number of characters, and in term of the bytes needed to
    encode in UTF-8

    If such research information have not been done, maybe one way to figure
    the result is to take tranlated Bible fo these language from swords
    project, strip out those xml tag and leave the pure text, and measure
    the size. Since all the Bible translation communicate the same
    information and the volumn is huge enough, that could be a good way to
    find out the result. Of course, those mark up need to be taken out to
    reduce the noise.

    This archive was generated by hypermail 2.1.5 : Wed Mar 05 2003 - 19:31:00 EST