Re: Devanagari

From: David Starner (starner@okstate.edu)
Date: Mon Jan 21 2002 - 00:19:28 EST


On Sun, Jan 20, 2002 at 10:44:00PM -0500, Aman Chawla wrote:
> For sites providing archives
> of documents/manuscripts (in plain text) in Devanagari, this factor could be
> as high as approx. 3 using UTF-8 and around 1 using ISCII.

Uncompressed, yes. It shouldn't be nearly as bad compressed - gzip, zip,
bzip2, or whatever your favorite tool is. You could also use UTF-16 or
SCSU, which will get it down to about 2 or about 1, respectively.

What's your point in continuing this? Most of the people on this list
already know how UTF-8 can expand the size of non-English text. There's
nothing we can do about it. Even if you had brought it up when UTF-8
was being designed, there's not much anyone could have done about it.
There is no simple encoding scheme that will encode Indic text in
Unicode in one byte per character.

It's the pigeonhole principle in action - if you need to encode 150,000
characters, you can't encode each one in one or two bytes, and while you
can write encodings that approach that for normal text, they aren't
going to be simple or pretty.

-- 
David Starner - starner@okstate.edu, dvdeug/jabber.com (Jabber)
Pointless website: http://dvdeug.dhis.org
When the aliens come, when the deathrays hum, when the bombers bomb,
we'll still be freakin' friends. - "Freakin' Friends"



This archive was generated by hypermail 2.1.2 : Sun Jan 20 2002 - 23:44:54 EST