RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

From: Gautam Sengupta (gsghyd@yahoo.com)
Date: Thu Oct 09 2003 - 22:22:22 CST


--- "Unicode (public)" <Unicode-mail@las-inc.com> wrote:
> Gautam--
>
> ...
>
> I don't understand. If the option to go to an alternative model is not available, why is it
> important to know that the alternative model would have been preferable?
 
[Gautam]: Just for the sake of knowing, I guess. "... ripeness is all".

> [Gautam]: I think there is a slight misunderstanding here. The ZWJ I am proposing is
> script-specific (each script would have its own), call it "ZWJ PRIME" or even "JWZ"
> (in order to avoid confusion with ZWJ). It doesn't exist yet and hence has no
> semantics.
>
> Okay. Maybe I'm dense, but this wasn't clear to me from your other emails.
 
[Gautam]: Heavens, no! It must be my non-native English that's creating all these communication gaps.
 
> You're not proposing that U+200D be used to join Indic consonants together; you're
> basically arguing for virama-like functionality that goes far enough beyond what the
> virama does that you're not comfortable calling it a virama anymore.
 
[Gautam]: Indeed. You got it just right. Let us introduce the term "Ind VIRAMA" to refer to the virama used in Sanskrit and other Indic languages, and Uni VIRAMA" to refer to the virama in Unicode. The two are *not* identical. Uni VIRAMA lacks the full functionality of Ind Virama. I am proposing two extensions to Uni Virama:
 
1. extension of its functionality to allow cons+combining vowel to be encoded as <Cons><VIRAMA><full Vowel>, and
 
2. extension of its functionality further to allow vowel+yophola to be encoded as <Vowel><VIRAMA><full Y>
 
(1) merely confers on Uni VIRAMA the full functionality of Ind VIRAMA, making the two functionally identical.
 
(2) is a hack, a crude ad hoc solution to the problem of how to encode Bangla vowel+yophola sequences. It is THIS latter extension that would make Uni VIRAMA un-VIRAMA-like, and hence my discomfiture with the name "VIRAMA". But (2) can be avoided if we can find some other solution to the YOPHOLA problem, such as assigning a code point to YOPHOLA in addition to the one already assigned to Y. And this (that is, addition of a distinct YOPHOLA on the code chart), by the way, would also disambiguate <R><Y> sequences in Bangla. (See Paul Nelson, "Bengali Script: Formation of the Reph and use of the ZERO WIDTH JOINER and ZERO WIDTH NON-JOINER"). I now feel that it is better to avoid extension 2 for the sake of keeping the model clean. Let us say we find some other acceptable solution to the problems raised by combinations involving YOPHOLA.
 
> To summarize:
>
> Tibetan deals with consonant clusters by encoding each of the consonants twice: One
> series of codes is to be used for the first consonant in a cluster, and the other series is
> to be used for the others. The Indian scripts don't do this; they use a single series
> of codes for the consonants and cause consonants to form clusters by adding a
> VIRAMA code between them. But the Indian scripts still have two series of VOWELS
> more or less analogous to the two series of consonants in Tibetan. When you want a
> non-joining vowel, you use one series, and when you want a joining vowel, you use the
> other.
 
[Gautam]: In Unicode Indic CV and CC sequences are treated differently. It uses the VIRAMA model for CC clusters, but the Tibetan model for CV's. I am suggesting the use of the VIRAMA model for BOTH.

> You want to have one series of vowels and extend the virama model to combining
> vowels. Thus, you'd represent KI as KA + VIRAMA + I; KA + I would represent two
> syllables: KA-I.
 
[Gautam]: Yes.
 
> Since a real virama never does this, you're using a different term ("JWZ" in your most
> recent message) for the character that causes the joining to happen.
 
[Gautam]: No, the *real* Ind VIRAMA does exactly this. Hence with this extension only (that is, as long as extension 2 is not implemented) I feel no compulsion to rename VIRAMA.

> You're not proposing any difference in how consonants are treated, other than having
> this new character server the sticking-together function that the VIRAMA now serves
> and changing the existing VIRAMA to always display explicitly.

> Now do I understand you? Sorry for my earlier misunderstandings.
 
[Gautam]: Yes, but note the clarifications provided in the preceding paragraphs.

> Now that we have freed up all those code points occupied by the combining forms of
> vowels by introducing the VIRAMA with extended function, let us introduce an explicit
> (always visible) VIRAMA. That's all.
>
> As far as Unicode is concerned, you can't "free up" any code points. Once a code
> point is assigned, it's always assigned. You can deprecate code points, but that
> doesn't free them up to be reused; it only (with luck) keeps people from continuing to
> use them.
 
[Gautam]: This is just too bad.

> It seems to me that a system could support the usage you want and the old usage at
> the same time. I could be wrong, but I'm guessing that KA + VIRAMA + I isn't a
> sequence that makes any sense with current implementations and isn't being used. It
> would be possible to extend the meaning of the current VIRAMA to turn the
> independent vowels into dependent vowels. Future use of the dependent-vowel code
> points could be discouraged in favor of VIRAMA plus the independent-vowel code
> points. Old documents would continue to work, but new documents could use the
> model you're after. (You get the explicit virama the same way you do now: VIRAMA + > ZWNJ.) This solution would involve encoding no new characters and no removal of
> existing characters, but just a change in the semantics of the VIRAMA.
 
[Gautam]: That sounds good. I would prefer an independent code point for the explicit VIRAMA, but on second thought VIRAMA+ZWNJ is not too bad either.
>
> That said, I'm not sure this is a good idea.
 
Here comes the punch line!
 
> If what you're really concerned about is typing and editing of text,
 
[Gautam]: No, that's certainly not my primary concern.
 
> you can have that work the way you want without changing the underlying encoding
> model. It involves somewhat more complicated keyboard handling, but I'm pretty
> sure all the major operating systems allow this. The basic idea is that you have one
> set of vowel keys that normally generate the independent-vowel code points, but if one
> of them is preceded by the VIRAMA key, the two keystrokes map to a single
> character: the dependent-vowel code point. This is a simple solution that can be
> implemented today with very little fuss and involves no changes to Unicode or to the
> various fonts and rendering engines that would be required of the VIRAMA code point
> took on a new meaning. From a user's point of view, things work the way they're
> supposed to, and they work that way sooner than if Unicode is changed.
 
[Gautam]: I have been aware of this solution all along since my corpus and language related work often involves keyboard remapping. This solution was also highlighted by Marco Cimarosti in a recent posting on this list. (Marco, I hope you are reading this). But that is NOT what I am after.
 
> Only programmers have to worry about the actual encoding details, and unless
> keeping the existing model makes THEIR jobs significantly harder, the encoding itself
> shouldn't change.
 
Yes, but not just programmers who are concerned with how a Unicode text should be encoded, but also those who are going to have to process these texts for various purposes. Let us first introduce a small notational convention and then consider a rather minor example.
 
Let the lowercase vowels henceforth denote *combining* vowels. In Bangla K+R+i and J+aa+I mean "I do" and "I go" respectively. Given these two forms as input, a morphological analyzer should ideally yield the following analyses: KRi = KR<VIRAMA> + I, JaaI = Jaa + I. (I am assuming orthographic - not phonemic/phonetic - input-output). In other words, the analyzer would have to insert an explicit virama after KR and somehow recognize the final <i> in KRi as <I>.
 
Now let's consider the same pair of inputs in *my* representation. They would be K+R+VIRAMA+I and J+VIRAMA+AA+I. All that the morphological analyzer would have to do is chop off the rightmost <I>. The leftovers are exactly what we need: K+R+VIRAMA and J+VIRAMA+AA. Isn't it amazing how evidence from diverse fields of inquiry seem to converge on the *correct* solution?
>
> I hope this makes sense...
 
-Gautam
 

---------------------------------
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST