Re: length of text by different languages

From: Ram Viswanadha (ram@jtcsv.com)
Date: Thu Mar 06 2003 - 18:23:21 EST

Next message: John Hudson: "Re: The display of *kholam* on PCs"

Previous message: John H. Jenkins: "Re: The display of *kholam* on PCs"
In reply to: Yung-Fong Tang: "Re: length of text by different languages"
Next in thread: Yung-Fong Tang: "Re: length of text by different languages"
Reply: Yung-Fong Tang: "Re: length of text by different languages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

There is also some information at
http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html#Test_Results

Not sure if this is what you are looking for.

Regards,

Ram Viswanadha
  ----- Original Message -----
  From: Yung-Fong Tang
  To: Francois Yergeau
  Cc: unicode@unicode.org
  Sent: Thursday, March 06, 2003 2:33 PM
  Subject: Re: length of text by different languages

Francois Yergeau wrote:

ftang@netscape.com wrote:

I remember there were some study to show although UTF-8 encode each
Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use
LESS characters in writting to communicate information than
alphabetic base langauges.

Any one can point to me such research?

I don't know of exactly what you want, but I vaguely remember a paper given
at a Unicode conference long ago that compared various translations of the
charter (or some such) of the Voice of America in a couple or three
encodings. Hmmmm, let's see.... could be this:

http://www.unicode.org/iuc/iuc9/Friday2.html#b3
Reuters Compression Scheme for Unicode (RCSU)
Misha Wolf
yea. That could be it. I got a hard copy and it looks like the Fig 2 is the one I am looking for.

No paper online, alas. I remember that Chinese was a clear winner in terms
of # of characters. In fact, I kind of remember that Chinese was so much
denser that it still won after RCSU (now SCSU) compression, which would mean
that a Han character contains more than twice as much info on average as a
Latin letter as used in (say) English.

This is all on pretty shaky ground, distant memories. Perhaps Misha stil
has the figures (if that's in fact the right paper).

Next message: John Hudson: "Re: The display of *kholam* on PCs"
Previous message: John H. Jenkins: "Re: The display of *kholam* on PCs"
In reply to: Yung-Fong Tang: "Re: length of text by different languages"
Next in thread: Yung-Fong Tang: "Re: length of text by different languages"
Reply: Yung-Fong Tang: "Re: length of text by different languages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Mar 06 2003 - 19:34:47 EST