RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

From: Gautam Sengupta (
Date: Wed Oct 08 2003 - 13:38:26 CST

--- Marco Cimarosti <>
> Gautam Sengupta wrote:
> > I am no programmer, but surely the rendering
> engine
> > could be tweaked to display a halant/hashant in
> the
> > aforementioned situations? I understand that it
> won't
> > happen *automatically* if we were to use <ZWJ>
> instead
> > of <VIRAMA>. But if you were to take the trouble
> to do
> > the tweaking, you'd then have a completely
> *intuitive*
> > encodings for vowel yaphala sequences,
> > <vowel><ZWJ><Y>, instead of oddities like
> > <vowel><VIRAMA><Y>.
> OK but, then, your <ZWJ> becomes exactly what
> Unicode's <VIRAMA> has always
> been: a character that is normally invisible,
> because it merges in a
> ligature with adjacent characters, but occasionally
> becomes visible when a
> font does not have a glyph for that combination.

You are absolutely right. I am suggesting that the
language-specific viramas be retained as
script-specific *explicit* viramas that never
disappear. In addition, let's have a script-specific
ZWJ which behaves in the way you describe in the
preceding paragraph. The explicit virama (rather the
ONLY virama) will never appear after a vowel, but the
language-specific ZWJ will, as in <A><ZWJ><Y><AA>
encoding A+YOPHOLA+AA. The cost is just one additional
code point for each script. Note that we will no
longer need the combining vowels or an additional code
point for YAPHOLA.
> But there is one detail which makes your approach
> much more complicated:
> what we have been calling <VIRAMA> is *not* a single
> character. Every Indic
> script has its own: <DEVANAGARI SIGN VIRAMA>,
> on.
> Each one of these characters, when displayed
> visibly, has a distinct glyph:
> a Bangla hashant is a small "/" under the letter, a
> Tamil virama is a dot
> over the letter, etc.
> With your approach, the single character <ZWJ> is
> overloaded with a dozen
> different glyphs depending on which script the
> adjacent letters belong to.
> Plus, it still has to be invisible when used in a
> non-Indic script, such as
> Arabic.
> Implementing all this is certainly possible, but
> would result in bigger
> look-up tables, for no advantage at all.

See my previous paragraph.

> > Perhaps there isn't a *problem* as such, and
> perhaps
> > naturalness and intuitive acceptability aren't
> *key*
> > features of the system, but surely other factors
> being
> > equal they ought be taken into consideration in
> > choosing one method of encoding over another?
> Yes. But the flaws that I see in ISCII/Unicode model
> are much smaller than you imply. E.g., I agree that
> it would have been more logic if:
> - independent and dependent vowels were the same
> characters;
> - each script was encoded in its natural
> alphabetical order;
> - there were no precomposed and decomposed
> alternatives for the same
> graphemes.
> And others, on which perhaps a linguist won't agree,
> but which would have
> made life much easier to programmers:
> - all vowels were encoded in visual order, so that
> vowel reordering was necessary;
> - "repha ra" were encoded as a separate characters,
> so that no reordering at all was necessary.

I agree with you on all of these issues. You have in
fact summed up my critique of the ISCII/Unicode model.
The only point I'd like to add here is that these
mistakes were avoidable and should have been avoided.
There can be no excuses for placing the Assamese r and
v the way they are currently placed. The same goes for
the long syllabic R and L.
> But, all summed up, leaving with these little flaws
> is *much* simpler than
> trying to change the rules of a standard a dozen
> years after people started
> implementing it.

Take a second look. My suggestion amounts to:

1. retaining the script-specific virama as it is. Its
existing behavior remains unchanged. I rename it as
"(script-specific) ZWJ" merely for my convenience and
conceptual clarity.

2. extending the role of this script-specific ZWJ to
encode combining forms of vowels in CV sequences,
entirely in line with the way it is used to encode CC

[1 and 2 may sound somewhat different from what I have
suggested above, but they are in effect the same].

3. introducing a script-specific explicit virama,
which we can very well afford after getting rid of all
the combining forms of vowels.

4. getting rid of *all* precomposed forms including
the recent innovations in Devanagari that are used
only for transliteration. These not only fill up the
code space of Devanagari but also put constraints on
the placement of characters in the code spaces of
other Indian scripts.

How much recoding would these changes involve? Would
the cost be really unacceptable?

Best, Gautam

Do you Yahoo!?
The New Yahoo! Shopping - with improved product search

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST