Problems using Unicode to display Indic scripts?

From: Feather Simon (Simon.Feather@icl.com)
Date: Thu Mar 02 2000 - 06:00:29 EST

Next message: Michael Everson: "Updated Southeast Asian scripts page"
Previous message: Dan Oscarsson: "Re: Rationale wanted for Unicode identifier rules"
Next in thread: Marco.Cimarosti@icl.com: "RE: Problems using Unicode to display Indic scripts?"
Maybe reply: Marco.Cimarosti@icl.com: "RE: Problems using Unicode to display Indic scripts?"
Maybe reply: F. Avery Bishop: "RE: Problems using Unicode to display Indic scripts?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I have a problem using Unicode to display Indic scripts (Bengali, Gujerati,
etc.). Can anyone please help explain what is happening, what the correct
behaviour *should* be?

The symptom:

We have some text in various Indic scripts that has been prepared in
Unitype's Globalwriter, and then exported to HTML. The resultant HTML
document contains UTF-8 encoded Unicode data. We are displaying the result
on web browsers, using Microsoft's Arial Universal MT Unicode font.

What we find is that a number of characters are coming out with their
component glyphs reversed. An example: in Globalwriter, we type the Bengali
Dha followed by Bengali vowel sign I; the software switches the two symbols
round on screen and kerns them together so that the vowel sign comes first.
This results in the following code sequence in the UTF-8 data:

09A6 09BF

Displaying this in IE shows Dha first then the vowel sign - which of course
is nonsense. If I use a hex editor to patch the data so that the sequence
is instead 09BF 09A6, then the glyphs are kerned together in the right order
to form the correct character.

We see the same thing happening with quite a number of characters in several
of the Indic scripts.

Questions:

a) are these "combining characters" or "graphemes"?
b) if they are combining characters, then according to my understanding of
combination, we should be seeing different canonical values for these two
characters, but the tables have them both with canonical order zero i.e.
they are both starters - surely this cannot be right?
c) should it matter which order they appear in the the UTF-8 data? If they
are combining characters, then again according to my understanding AB does
not equal BA, which is I think what we're seeing here.
d) and if the order matters, then is it Unitype that has it wrong or the
display mechanism?

Any help gratefully received

Many thanks

Regards

Simon Feather
ICL
Phone: 01344 472003
Fax: 01344 473008
Mobile: 07867 824477
Email: simon.feather@icl.com

Next message: Michael Everson: "Updated Southeast Asian scripts page"
Previous message: Dan Oscarsson: "Re: Rationale wanted for Unicode identifier rules"
Next in thread: Marco.Cimarosti@icl.com: "RE: Problems using Unicode to display Indic scripts?"
Maybe reply: Marco.Cimarosti@icl.com: "RE: Problems using Unicode to display Indic scripts?"
Maybe reply: F. Avery Bishop: "RE: Problems using Unicode to display Indic scripts?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT