RE: Problems using Unicode to display Indic scripts?

From: F. Avery Bishop (averyb@Exchange.Microsoft.com)
Date: Thu Mar 02 2000 - 11:48:13 EST


Microsoft Internet Explorer 5.0 is enabled for only Devanagari and Tamil, so
this is expected behavior when you input Bengali, Malayalam, Oriya, Telugu,
Kanada, or any Indian script other than the two supported scripts.

There are plans to support other Indian scripts over time, but I don't know
the schedule.

F. Avery Bishop
Program Manager, Speech API
averyb@microsoft.com

-----Original Message-----
From: Marco.Cimarosti@icl.com [mailto:Marco.Cimarosti@icl.com]
Sent: Thursday, March 02, 2000 4:23 AM
To: Unicode List
Subject: RE: Problems using Unicode to display Indic scripts?

My colleague Simon Feather wrote:

> I have a problem using Unicode to display Indic scripts (Bengali,
Gujerati,
> etc.). Can anyone please help explain what is happening, what the correct
> behaviour *should* be?

See "ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html": it has
all the answers to your questions, although in a very succinct way.

> a) are these "combining characters" or "graphemes"?

U+09A6 (BENGALI LETTER DA) is not a combining character, because its general
class is "Lo" (Other Letter). U+09BF (BENGALI VOWEL SIGN I) is a combining
character, because its general category is "Mc" (Spacing Combining Mark).

My meaning for "grapheme" is "any written symbol", so of course they both
are. I am possibly missing a special technical meaning of the term here?
 
> b) if they are combining characters, then according to my understanding of
> combination, we should be seeing different canonical values for these two
> characters, but the tables have them both with canonical order zero i.e.
> they are both starters - surely this cannot be right?

Whether a character is combining or not is determined by its general
category, not by its canonical combining class.

The canonical combining class "0" means several different things: "spacing,
split, enclosing, reordrant, and Tibetan subjoined". U+09A6 (BENGALI LETTER
DA) is a "spacing" character, while U+09BF (BENGALI VOWEL SIGN I) falls in
the "reordrant" case.

> c) should it matter which order they appear in the the UTF-8 data? If
they
> are combining characters, then again according to my understanding AB does
> not equal BA, which is I think what we're seeing here.

In UTF-8 (or in any other form of Unicode) the Bengali syllable "di" should
be spelled as:

        U+09A6 U+09BF

and displayed in reverse order, as:

        [glyph for U+09BF] [glyph for U+09A6]

This is why U+09BF (BENGALI VOWEL SIGN I) is called "reordrant".

> d) and if the order matters, then is it Unitype that has it wrong or the
> display mechanism?

I would say that Unitype Globalwriter behaves itself properly, while your
browser is the naughty guy.

Ciao.
        Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT