Re: Bangla: [ZWJ], [VIRAMA] and CV sequences

From: Gautam Sengupta (gsghyd@yahoo.com)
Date: Wed Oct 08 2003 - 01:56:29 CST

Next message: Gautam Sengupta: "Re: Bangla: [ZWJ], [VIRAMA] and CV sequences"
Previous message: Gautam Sengupta: "Re: Bangla: [ZWJ], [VIRAMA] and CV sequences"
In reply to: Deepayan Sarkar: "Re: Bangla: [ZWJ], [VIRAMA] and CV sequences"
Next in thread: Gautam Sengupta: "Re: Bangla: [ZWJ], [VIRAMA] and CV sequences"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

--- Deepayan Sarkar <deepayan@stat.wisc.edu> wrote:
> On Tuesday 07 October 2003 21:44, Gautam Sengupta
> wrote:
> > > I don't know what the original motivations were,
> but
> > > one thing about the
> > > current (ISCII-based) encoding scheme that
> appeals
> > > to me is that on average
> > > it requires fewer characters than other more
> natural
> > > schemes. Bangla has a
> > > high percentage of 'vowel signs', each of which
> > > would require two characters
> > > in your scheme as opposed to one in the current
> one.
> >
> > There is a trade-off here between file size and
> the
> > number of code points used. File size could be
> further
> > reduced, for example, if combining forms of
> consonants
> > were introduced. But that would be a step in the
> wrong
> > direction for various reasons that I will not
> discuss
> > here. I am not sure that the right thing to do is
> to
> > economize on file size rather than code points.
>
> That's a matter of opinion, and as I said, I don't know the motivations of the
> original designers.

No, it's also a matter of uniformity and elegance. If consonant clusters are [C][ZWJ or VIRAMA][C], there's no reason why CV clusters shouldn't be treated the same way.

> In any case, I wouldn't dwell too much on this for 2 reasons. First, it's very unlikely
> that you are going to be able to influence people enough to induce changes at such a
> fundamental level (especially at this late stage when there are already fully
> functional rendering implementations based on the current scheme).

I agree with you on this. But we have an obligation to explore and figure out the alternatives that would have been better for Bangla and other Indian scripts had they been proposed and accepted, if only for the sake of a better understanding of our scripts.

> Second, why does it matter what one particular encoding scheme does?
> If you think something is better, use it, along with some mechanism for
> converting from unicode to your scheme and vice versa.

That sound like a very ad hoc solution. We should be able to do better than that.

>Of course, this assumes that it is possible to represent all reasonable features of
> Bengali in Unicode, which it should be. If you think there's something that's not
> possible, I believe there's a formal mechanism via which you can submit
> requests/proposals to the Unicode consortium.

It's not just a matter of whether all the relevant features of a script can be encoded using a particular mechanism. If there are more than one such encoding, we have to choose between them, and our choice will have to be guided by considerations of economy, uniformity and elegance. In the present scheme there is no elegant solution to the problem of encoding [Vowel-A][J-PHOLA][AA-kar]. You have to do something ad hoc. On my scheme you'd exactly what you do elsewhere, namely, [Vowel-A][ZWJ][Y][ZWJ][AA].

> > > > Also, why not use [CONS][ZWJ][CONS] instead of
> > > > [CONS][VIRAMA][CONS]? One could then use
> [VIRAMA]
> > > > only where it is explicit/visible.
> > > But this would not reflect the fact that the
> *glyph*
> > > [CONS][ZWJ][CONS] is
> > > actually the same thing as the *sequence of
> > > characters* [CONS][VIRAMA][CONS],
> >
> > But, it is not, certainly not in writing; and
> that's
> > the whole point. [CONS][ZWJ][CONS] and
> > [CONS][(EXPLICIT)VIRAMA][CONS] are "identical" at
> a
> > level of linguistics abstraction that need not be
> > reflected in text encoding. Consider [C][L] and
> > [C][L][VIRAMA]. They represent the same words,
> they
> > are the "same" at some level of representation,
> but
> > that is irrelevant for the task at hand.
>
> What exactly are [C] and [L] here ?

The letters [CA] and [LA], as in Bangla /cOl/ "come!" which can be written both with and without a final [VIRAMA]. The author's choice in this matter has to be respected.

> > > This latter decision is one that should be taken
> > > (normally) by the rendering mechanism (loosely
> > > speaking, the font), not the author.
> >
> > I disagree. If an author chooses to write a word
> with
> > an explicit virama, you have to respect that and
> let
> > it be reflected in the encoding. Leaving such
> > decisions to the rendering engine would destroy
> the
> > character and flavor of certain texts. Furthermore
> > there are metalinguistic uses of the explicit
> virama
> > that need to be kept distinct from forms with
> > conjoined characters.
>
> I did qualify my statement by saying that this
> should be the normal behaviour.
> An author would usually not bother about whether her
> 'da + ukaar' or 'sa +
> yaphala' is written by a distinct ligated glyph. The
> explicit virama-s you
> mention are definitely a common feature where the
> author's control is
> important, and that's what ZWNJ is for.

But the encoding that uses [ZWNJ] to encode an explicit [VIRAMA] is much less intuitive that the one I am suggesting. The [ZWNJ] in the latter encoding merely acts as a flag to alert us that something very ad hoc is going on here! Our encodings should not only be adequate for the job of representing written texts, they should also *mean* something to us.

> As for the 'flavor' of the texts you mention, if you are talking about visual
> appearance, then that's the purpose of the font you are using.

No, I am NOT talking about visual appearance. I am talking about writing a word with an explicit virama vs. writing it with a conjoined character. Recall /choToder pattaRi/ in Jugaantar.

> You will have a valid point if you can show an example where
> there's some text that you cannot reproduce with (1) unicode + (2) a properly
> implemented renderer + (3) a properly implemented font. Do you have any such
> example ?

No, this is going back to the claim that all is well as long as everything can be given an unambiguous representation, no matter how ad hoc or counterintuitive. This approach has already done a lot of harm to the system: look at the placement of diirgha RI and LI or even the Assamese RA and VA (a revision for the latter has been suggested and appears to be entirely on the right track), or even the proposal to assign a code point to KSH in Bangla, Hindi etc. (The fact that sorting/collation is often language-specific should not be misused to justify random assignment of code points to characters) or encodings to character strings. Compare, for example, [Vowel-A][ZWJ][Y][ZWJ][AA] in my scheme of encoding with the equivalent one in Unicode.

Best, Gautam

Gautam Sengupta
Professor of Applied Linguistics
Director, School of Linguistics & Language Technology
Jadavpur University
Kolkata 700 032, INDIA
Email: gsghyd@icqmail.com

---------------------------------
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search

Next message: Gautam Sengupta: "Re: Bangla: [ZWJ], [VIRAMA] and CV sequences"
Previous message: Gautam Sengupta: "Re: Bangla: [ZWJ], [VIRAMA] and CV sequences"
In reply to: Deepayan Sarkar: "Re: Bangla: [ZWJ], [VIRAMA] and CV sequences"
Next in thread: Gautam Sengupta: "Re: Bangla: [ZWJ], [VIRAMA] and CV sequences"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST