RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Oct 08 2003 - 03:58:11 CST


Gautam Sengupta wrote:
> Is there any reason (apart from trying to be
> ISCII-conformant) why the Bangla word /ki/ "what"
> cannot be encoded as [KA][ZWJ][I]? Do we really need
> combining forms of vowels to encode Indian scripts?

Perhaps you are right that it *would* have been a cleaner design to have
only one set of vowel.

But notice that <KA><+><I> is one character longer that <KA><+I>. Maybe
storage space is not a big problem these days, but still it makes 2 to 4
extra bytes for each consonant not followed by the inherent vowel /a/.

Perhaps it *would* have been better to have only the combining vowels, and
to form independent vowels with a "mute consonant" (actually, the
independent vowel "a").

> Also, why not use [CONS][ZWJ][CONS] instead of
> [CONS][VIRAMA][CONS]? One could then use [VIRAMA] only
> where it is explicit/visible.

OK. But what happens when the font does not have a glyph for the ligature
<cons><ZWJ><cons>, nor for the half consonant <cons><ZWJ>, nor for the
subjoined consonant <ZWJ><cons>?

As <ZWJ>, per se, is an invisible character, what happens is that your
string displays as <cons><cons>, which is clearly semantically incorrect. If
you want the explicit virama to be visible, you need to encode it as
<cons><VIRAMA><cons>.

And this means that you (the author of the text) are forced to chose between
<ZWJ> and <VIRAMA> based on the availability of glyphs in the *particular*
font that you are using while typing. And this is a big no no no, because it
would impede you to change the font without re-typing part of the text.

What happens with the current Unicode scheme is that, if the font does not
have a glyph for the ligature <cons><VIRAMA><cons>, nor for the half
consonant <cons><VIRAMA>, nor for the subjoined consonant <VIRAMA><cons>,
the virama is *automatically* displayed visibly, so that the semantics of
the text is always safe, even if rendered with the most stupid of fonts.

> Surely, [A/E][ZWJ][Y][ZWJ][AA] is more "natural" and
> intuitively acceptable than any encoding in which a
> vowel is followed by a [VIRAMA]?

Maybe. But I see no reason why being natural or intuitive should be seen as
key feature for an encoding system. That might be the case for an encoding
system designed to be used by humans, but Unicode is designed to be used by
computers, so I don't see the problem.

I assume that in a well designed Bengali input method, yaphala would be a
key on its own, so, by the point of view of the user, it is just a
"character": they don't need to know that when they press that key the
sequence of codes <VIRAMA><YA> will actually be inserted, so they won't
notice the apparent nonsense of the sequence <vowel><VIRAMA> and, as we say
in Italy, "If eye doesn't see, heart doesn't hurt".

_ Marco



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST