Re: Bangla: [ZWJ], [VIRAMA] and CV sequences

From: Deepayan Sarkar (deepayan@stat.wisc.edu)
Date: Tue Oct 07 2003 - 23:58:48 CST


On Tuesday 07 October 2003 21:44, Gautam Sengupta wrote:
> > I don't know what the original motivations were, but
> > one thing about the
> > current (ISCII-based) encoding scheme that appeals
> > to me is that on average
> > it requires fewer characters than other more natural
> > schemes. Bangla has a
> > high percentage of 'vowel signs', each of which
> > would require two characters
> > in your scheme as opposed to one in the current one.
>
> There is a trade-off here between file size and the
> number of code points used. File size could be further
> reduced, for example, if combining forms of consonants
> were introduced. But that would be a step in the wrong
> direction for various reasons that I will not discuss
> here. I am not sure that the right thing to do is to
> economize on file size rather than code points.

That's a matter of opinion, and as I said, I don't know the motivations of the
original designers. In any case, I wouldn't dwell too much on this for 2
reasons. First, it's very unlikely that you are going to be able to influence
people enough to induce changes at such a fundamental level (especially at
this late stage when there are already fully functional rendering
implementations based on the current scheme). Second, why does it matter what
one particular encoding scheme does ? If you think something is better, use
it, along with some mechanism for converting from unicode to your scheme and
vice versa. Of course, this assumes that it is possible to represent all
reasonable features of Bengali in Unicode, which it should be. If you think
there's something that's not possible, I believe there's a formal mechanism
via which you can submit requests/proposals to the Unicode consortium.

> > > Also, why not use [CONS][ZWJ][CONS] instead of
> > > [CONS][VIRAMA][CONS]? One could then use [VIRAMA]
> > > only where it is explicit/visible.
> >
> > But this would not reflect the fact that the *glyph*
> > [CONS][ZWJ][CONS] is
> > actually the same thing as the *sequence of
> > characters* [CONS][VIRAMA][CONS],
>
> But, it is not, certainly not in writing; and that's
> the whole point. [CONS][ZWJ][CONS] and
> [CONS][(EXPLICIT)VIRAMA][CONS] are "identical" at a
> level of linguistics abstraction that need not be
> reflected in text encoding. Consider [C][L] and
> [C][L][VIRAMA]. They represent the same words, they
> are the "same" at some level of representation, but
> that is irrelevant for the task at hand.

What exactly are [C] and [L] here ?

> > This latter decision is one that should be taken
> > (normally) by the rendering mechanism (loosely
> > speaking, the font), not the author.
>
> I disagree. If an author chooses to write a word with
> an explicit virama, you have to respect that and let
> it be reflected in the encoding. Leaving such
> decisions to the rendering engine would destroy the
> character and flavor of certain texts. Furthermore
> there are metalinguistic uses of the explicit virama
> that need to be kept distinct from forms with
> conjoined characters.

I did qualify my statement by saying that this should be the normal behaviour.
An author would usually not bother about whether her 'da + ukaar' or 'sa +
yaphala' is written by a distinct ligated glyph. The explicit virama-s you
mention are definitely a common feature where the author's control is
important, and that's what ZWNJ is for.

As for the 'flavor' of the texts you mention, if you are talking about visual
appearance, then that's the purpose of the font you are using. You will have
a valid point if you can show an example where there's some text that you
cannot reproduce with (1) unicode + (2) a properly implemented renderer + (3)
a properly implemented font. Do you have any such example ?

Deepayan



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST