RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

From: Gautam Sengupta (
Date: Thu Oct 09 2003 - 00:07:23 CST

--- "Unicode (public)" <> wrote:
> Two of the most basic Unicode stability policies dictate that character
> assignments, once made, are never removed and character names can never
> change. Step 4 cannot happen; the best that can happen is that the code
> points in question can be deprecated. The renaming you suggest in 1
> cannot happen either.
[Gautam]: Well, too bad. I guess we still have an obligation to explore the extent of sub-optimal solutions that are being imposed upon South-Asian scripts for the sake of *backward compatibility* or simply because they are "fait accomplis". (See Peter Kirk's posting on this issue). However, I am by no means suggesting that the fault lies with the Unicode Consortium.

> The change in the encoding model for the virama can't happen either;
> there are too many implementations based on it, and there are too many
> documents out there that use the current encoding model. Your
> suggestion wouldn't make them unreadable when opened with software that
> did things the way you're suggesting, but it would change their
> appearance in ways that are unlikely to be acceptable.
[Gautam]: This is again the "fait accompli" argument. We need to *know* whether adopting an alternative model WOULD HAVE BEEN PREFERABLE, even if the option to do so is no longer available to us. The model I am proposing is precisely the one that has been in use for centuries in the Indian grammatical tradition (/ki/ = k+virama+i). I don't think there are too many South-Asian documents out there encoded in Unicode. At any rate converting them would be a rather simple matter of searching for combining forms of vowels and replacing them by the [VIRAMA][VOWEL] sequence. The TDIL corpora are very small by current standards, and they require extensive reworking anyway.

> [I preface what follows with the observation that I'm not by any stretch
> of the imagination an expert on Indic scripts, but I do fancy myself an
> expert on Unicode.]
> I'm also pretty sure that using ZWJ as a virama won't work and isn't
> intended to work. KA + ZWJ + KA means something totally different from
> KA + VIRAMA + KA, and I, for one, wouldn't expect them to be drawn the
> same. U+0915 represents the letter KA with its inherent vowel sound;
> that is, it represents the whole syllable KA. Two instances of U+0915
> in a row would thus represent "KAKA", completely irrespective of how
> they're drawn. Introducing a ZWJ in the middle would allow the two
> SYLLABLES to ligate, but there's no ligature that represents "KAKA", so
> you should get the same appearance as you do without the ZWJ. The
> virama, on the other hand, cancels the vowel sound on the KA, turning it
> into K: The sequence KA + VIRAMA + KA represents the syllable KKA, again
> irrespective of how it is drawn.
> In other words, ZWJ is intended to change the APPEARANCE of a piece of
> text without changing its MEANING (there are exceptions in the Arabic
> script, but this is the general rule). Having KA + ZWJ + KA render as
> the syllable KKA would break this rule: the ZWJ would be changing the
> MEANING of the text.
> Whether the syllable KKA gets drawn with a virama, a half-form, or a
> ligature is the proper province of ZWJ and ZWNJ, and this is what
> they're documented in TUS to do. But ZWJ can't (and shouldn't) be used
> to turn KAKA into KKA.
[Gautam]: I think there is a slight misunderstanding here. The ZWJ I am proposing is script-specific (each script would have its own), call it "ZWJ PRIME" or even "JWZ" (in order to avoid confusion with ZWJ). It doesn't exist yet and hence has no semantics. JWZ is a piece of formalism. Its meaning would be precisely what we chose to assign to it. It behaves like the existing (script-specific) VIRAMA's except that it also occurs between a consonant and an independent vowel, forcing the latter to show up in its combining form. In this respect, it is in fact *closer* or *more faithful* to the classical VIRAMA model. Call it VIRAMA if you will. The only reason why I don't wish to call it "VIRAMA" is because I plan to use it after a vowel as well, as in: <A><JWZ><Y><JWZ<AA> encoding A+YOPHOLA+AA. If YOPHOLA is assigned an independent code point then this move would be unnecessary and my JWZ would just be the usual VIRAMA with an extended function that would, in fact, make it more
 compliant with the classical VIRAMA model.
Now that we have freed up all those code points occupied by the combining forms of vowels by introducing the VIRAMA with extended function, let us introduce an explicit (always visible) VIRAMA. That's all.

> Maybe it was unfortunate to call U+094D a "virama," since it doesn't
> necessarily get drawn as a virama (or, indeed, as anything), but it's
> too late to revisit that decision.
No, the decision is not unfortunate because of that, but rather because U+094D doesn't behave like a virama in all respects, and hence my proposal for extension of its functions.
> For that matter, it may have been a mistake to use the virama model to encode
> conjunct forms in Bengali, ...
Not really. But once adopted, the model should have been implemented in full, eliminating the need for combing forms of vowels. Thanks a lot Rich.

Do you Yahoo!?
The New Yahoo! Shopping - with improved product search

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST