RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

From: Unicode (public) (
Date: Thu Oct 09 2003 - 10:14:01 CST


        [Gautam]: Well, too bad. I guess we still have an obligation to
explore the extent of sub-optimal solutions that are being imposed upon
South-Asian scripts for the sake of *backward compatibility* or simply
because they are "fait accomplis". (See Peter Kirk's posting on this
issue). However, I am by no means suggesting that the fault lies with
the Unicode Consortium.

I'm a little confused by this statement. What would be the difference
between sticking with a suboptimal solution because it's a fait accompli
and sticking with it out of the need for backward compatibility? The
need for backward compatibility exists because the suboptimal solution
is a fait accompli. Or are you stating that backward compatibility is a
specious argument because the encoding is so broken nobody's actually
using it?

        [Gautam]: This is again the "fait accompli" argument. We need to
*know* whether adopting an alternative model WOULD HAVE BEEN PREFERABLE,
even if the option to do so is no longer available to us.

I don't understand. If the option to go to an alternative model is not
available, why is it important to know that the alternative model would
have been preferable?

        [Gautam]: I think there is a slight misunderstanding here. The
ZWJ I am proposing is script-specific (each script would have its own),
call it "ZWJ PRIME" or even "JWZ" (in order to avoid confusion with
ZWJ). It doesn't exist yet and hence has no semantics.

Okay. Maybe I'm dense, but this wasn't clear to me from your other
emails. You're not proposing that U+200D be used to join Indic
consonants together; you're basically arguing for virama-like
functionality that goes far enough beyond what the virama does that
you're not comfortable calling it a virama anymore.

         JWZ is a piece of formalism. Its meaning would be precisely
what we chose to assign to it. It behaves like the existing
(script-specific) VIRAMA's except that it also occurs between a
consonant and an independent vowel, forcing the latter to show up in its
combining form.

Aha! This is what I wasn't parsing out of your previous emails. It was
there, but I somehow didn't grok it. To summarize:
Tibetan deals with consonant clusters by encoding each of the consonants
twice: One series of codes is to be used for the first consonant in a
cluster, and the other series is to be used for the others. The Indian
scripts don't do this; they use a single series of codes for the
consonants and cause consonants to form clusters by adding a VIRAMA code
between them. But the Indian scripts still have two series of VOWELS
more or less analogous to the two series of consonants in Tibetan. When
you want a non-joining vowel, you use one series, and when you want a
joining vowel, you use the other.
You want to have one series of vowels and extend the virama model to
conbining vowels. Thus, you'd represent KI as KA + VIRAMA + I; KA + I
would represent two syllables: KA-I. Since a real virama never does
this, you're using a different term ("JWZ" in your most recent message)
for the character that causes the joining to happen. You're not
proposing any difference in how consonants are treated, other than
having this new character server the sticking-together function that the
VIRAMA now serves and changing the existing VIRAMA to always display
Now do I understand you? Sorry for my earlier misunderstandings.

        Now that we have freed up all those code points occupied by the
combining forms of vowels by introducing the VIRAMA with extended
function, let us introduce an explicit (always visible) VIRAMA. That's

As far as Unicode is concerned, you can't "free up" any code points.
Once a code point is assigned, it's always assigned. You can deprecate
code points, but that doesn't free them up to be reused; it only (with
luck) keeps people from continuing to use them.
It seems to me that a system could support the usage you want and the
old usage at the same time. I could be wrong, but I'm guessing that KA
+ VIRAMA + I isn't a sequence that makes any sense with current
implementations and isn't being used. It would be possible to extend
the meaning of the current VIRAMA to turn the independent vowels into
dependent vowels. Future use of the dependent-vowel code points could
be discouraged in favor of VIRAMA plus the independent-vowel code
points. Old documents would continue to work, but new documents could
use the model you're after. (You get the explicit virama the same way
you do now: VIRAMA + ZWNJ.) This solution would involve encoding no new
characters and no removal of existing characters, but just a change in
the semantics of the VIRAMA.
That said, I'm not sure this is a good idea. If what you're really
concerned about is typing and editing of text, you can have that work
the way you want without changing the underlying encoding model. It
involves somewhat more complicated keyboard handling, but I'm pretty
sure all the major operating systems allow this. The basic idea is that
you have one set of vowel keys that normally generate the
independent-vowel code points, but if one of them is preceded by the
VIRAMA key, the two keystrokes map to a single character: the
dependent-vowel code point. This is a simple solution that can be
implemented today with very little fuss and involves no changes to
Unicode or to the various fonts and rendering engines that would be
required of the VIRAMA code point took on a new meaning. From a user's
point of view, things work the way they're supposed to, and they work
that way sooner than if Unicode is changed. Only programmers have to
worry about the actual encoding details, and unless keeping the existing
model makes THEIR jobs significantly harder, the encoding itself
shouldn't change.
I hope this makes sense...
--Rich Gillam
  Language Analysis Systems, Inc.

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST