I have a problem using Unicode to display Indic scripts (Bengali, Gujerati,
etc.). Can anyone please help explain what is happening, what the correct
behaviour *should* be?
We have some text in various Indic scripts that has been prepared in
Unitype's Globalwriter, and then exported to HTML. The resultant HTML
document contains UTF-8 encoded Unicode data. We are displaying the result
on web browsers, using Microsoft's Arial Universal MT Unicode font.
What we find is that a number of characters are coming out with their
component glyphs reversed. An example: in Globalwriter, we type the Bengali
Dha followed by Bengali vowel sign I; the software switches the two symbols
round on screen and kerns them together so that the vowel sign comes first.
This results in the following code sequence in the UTF-8 data:
Displaying this in IE shows Dha first then the vowel sign - which of course
is nonsense. If I use a hex editor to patch the data so that the sequence
is instead 09BF 09A6, then the glyphs are kerned together in the right order
to form the correct character.
We see the same thing happening with quite a number of characters in several
of the Indic scripts.
a) are these "combining characters" or "graphemes"?
b) if they are combining characters, then according to my understanding of
combination, we should be seeing different canonical values for these two
characters, but the tables have them both with canonical order zero i.e.
they are both starters - surely this cannot be right?
c) should it matter which order they appear in the the UTF-8 data? If they
are combining characters, then again according to my understanding AB does
not equal BA, which is I think what we're seeing here.
d) and if the order matters, then is it Unitype that has it wrong or the
Any help gratefully received
Phone: 01344 472003
Fax: 01344 473008
Mobile: 07867 824477
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT