Re: Bangla: [ZWJ], [VIRAMA] and CV sequences

From: Gautam Sengupta (
Date: Sat Oct 11 2003 - 06:37:39 CST

--- Peter Kirk <> wrote:
> On 09/10/2003 21:22, Gautam Sengupta wrote:
> > ...
> >
> > Yes, but not just programmers who are concerned
> with how a Unicode
> > text should be encoded, but also those who are
> going to have to
> > process these texts for various purposes. Let us
> first introduce a
> > small notational convention and then consider a
> rather minor example.
> >
> > Let the lowercase vowels henceforth denote
> *combining* vowels. In
> > Bangla K+R+i and J+aa+I mean "I do" and "I go"
> respectively. Given
> > these two forms as input, a morphological analyzer
> should ideally
> > yield the following analyses: KRi = KR<VIRAMA> +
> I, JaaI = Jaa + I. (I
> > am assuming orthographic - not phonemic/phonetic -
> input-output). In
> > other words, the analyzer would have to insert an
> explicit virama
> > after KR and somehow recognize the final <i> in
> KRi as <I>.
> >
> > Now let's consider the same pair of inputs in *my*
> representation.
> > They would be K+R+VIRAMA+I and J+VIRAMA+AA+I. All
> that the
> > morphological analyzer would have to do is chop
> off the rightmost <I>.
> > The leftovers are exactly what we need: K+R+VIRAMA
> and J+VIRAMA+AA.
> > Isn't it amazing how evidence from diverse fields
> of inquiry seem to
> > converge on the *correct* solution?
> > >
> > > I hope this makes sense...
> >
> > -Gautam
> >
> It would surely be trivial for any morphological
> analyser to understand
> i as a ligature or contraction of <VIRAMA, I>, split
> it into the
> sequence, and then analyse the version with the
> sequence. Any
> morphological analyser is going to have to deal with
> ligatures and
> contractions. It could be programmed as a
> morphophonemic contraction,
> even if that is not technically linguistically
> correct.

[Gautam]: I did hedge my claim by saying that I was
going to cite a rather minor example. But why would I
want to do this extra bit of computing - however
trivial - when I could have avoided it by adopting a
more "appropriate" encoding in the first place? After
all, what I am suggesting is that the VIRAMA model
once adopted ought to have been implemented in full.
Is there any particular reason why it should be
adopted for CC but not for CV sequences?

Encoding /ki/ as <K><i> (using lowercase vowels to
denote combining forms and letters within slashes to
denote phonemes rather than characters) is also
semantically inappropriate. <K> stands for /ka/ not
/k/, and <i> being a combining form of <I> simply
stands for /i/. So <K><i> should stand for /kai/
rather than /ki/ unless a VIRAMA is inserted between
the <K> and the <i> to remove the default inherent
vowel /a/ from <K>.

I hope this makes sense. Best, Gautam.

Do you Yahoo!?
The New Yahoo! Shopping - with improved product search

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST