Re: Encoding Bengali Vowel forms (again)

From: Peter Constable (
Date: Fri Apr 28 2000 - 11:33:08 EDT

       Marco>> As usual, I cannot stop spitting my little word :-|

       Antoine>I believe I am as bad as you are. :-|.

       OK, I'll go along. :-|

       I'm very much inclined to agree with Marco that nothing *new*
       is needed, and also with Antoine that interested parties should
       discuss alternatives and agree on what will be done.

       Marco said:
>In general, viramas are just characters as any other,
>and can occur *everywhere*. And this a general
>feature of Unicode: with few reasonable exceptions
>(e.g. unpaired surrogates), Unicode does not have a
>"syntax" that stipulates which sequences of
>characters are legal and which are not.

       Marco's general comment about Unicode not having a syntax
       (apart from things like surrogates) is, in my understanding,
       mostly but not 100% true. For example, the standard does
       indicate that Devanagari dependent vowels are to be encoded
       after their consonant (in logical order) while Thai vowels are
       encoded in visual order (which sometimes means before the
       consonant). It's necessary to mandate some things of this sort
       so that the standard will get implemented in software, and
       implemented in a consistent manner such that data interchange
       is possible (and that's the purpose for a character encoding
       standard). It would be a big problem for data interchange if
       Devanagari dependent vowels were sometimes encoded before and
       sometimes after the consonant at the whim of individual

       In my mind, more of this is actually needed. Several months
       ago, we were working on our Yi font, and the samples that our
       clients showed us had occasional use of a middle dot as
       punctuation. Now, how many choices might there be for encoding
       this? I never made a thorough count, but it's more than one. I
       inquired on this list and with UTC to see if anyone could tell
       me what this punctuation character is and how it should be
       encoded, and nobody gave a definitive answer, probably because
       nobody had considered it before. We ended up using 30fb
       KATAKANA MIDDLE DOT since this would have the
       fullwidth/monowidth properties needed for Yi. But what if
       another implementer chose to use one of the other characters
       with a similar visual appearance? The result would be a
       hindrance to successful interchange.

       But I'm rambling. My point is that it is important for this
       issue to be discussed and that implementers agree on a
       solution. But, what Marco said about nothing prohibiting
       combining virama in new ways is absolutely true, as far as I

       Now, Apurva wrote:

>The semantics of Ya in conjunct formation and for
>use with LetterA /LetterE is very different.

       Semantics are different in what sense? Do you mean that they
       would represent different things phonologically/linguistically,
       or that different Unicode semantics would be required? If it's
       just a matter of different linguistic significance, that is a
       non-issue. The letter "g" has different phonological meaning
       between "rag" and in "rough"; "e" has different phonological
       meaning between "feet" and "fate". But that doesn't mean
       different encodings are needed for these.

       There is nothing about the Unicode semantics of Bengali
       characters that prohibit using what is already there. All
       that's needed is to abandon certain assumptions, which Marco
       has already discussed. (I'll forward that message to the
       OpenType list for the benefit of people on that list who aren't
       on Unicode.) If you want to propose adding new characters to
       Unicode, you need to have good reasons why an implementation
       using the existing characters is inadequate *in terms of text
       processing issues* (not in terms of how speakers/writers think
       of the orthography - that is essentially irrelevant).

       As far as using the PUA is concerned, yes, that's an option.
       It's becomes problematic, however, if you want all implementers
       to agree on particular PUA characters. Let's say everybody
       interested in Bengali gets together and agrees that E000 and
       E001 will be used for Vowel A_zophola_AA and Vowel
       E_zophola_AA, and let's suppose further that Apurva and co
       implement Uniscribe and some OT fonts based on this. In the
       mean time, somebody else has (as they are free to do) defined
       for their use E000 and E001 for a couple of Ethiopic characters
       that are being considered for future addition to Unicode.
       (That's a real situation - we're currently doing some work on
       Ethiopic, and we have made a number of such PUA assignments.)
       Now, that person has an Ethiopic font, and they want to display
       some text using MS software. They'll be pretty upset if
       Uniscribe munges their PUA characters. It's a legal use of
       Unicode for MS to define PUA characters for particular uses
       (though they are encouraged to do so near the top of the PUA
       range, and they really ought to publically document what they
       do so that users will know what to expect of their software).
       But if they want to be concerned about what end users may want
       to do with their software, they need to think very carefully
       about any PUA assignments they make. As far as encouraging a
       widespread pseudo-standard use of the PUA, that is potentially
       counter to the intension of Unicode, particularly if you are
       trying to get a number major software developers to go along.

       I have no problem with a couple of PUA characters being used by
       a group of people interested in Bengali as an interim solution
       for the potential characters. Getting some particualr support
       for that in Uniscribe would be, I think, not a good thing, and
       I'd be very surprised if MS would entertain that possibility.
       (But then, if you use the PUA, you don't need any smart font
       behaviour for these characters.)

       But I'd argue with Marco in favour of your other proposed
       interim solution, and I'd argue that it shouldn't be just an
       interim solution but rather the permanent solution.

       Peter Constable

       From: <> AT Internet on 04/27/2000 07:07

       To: Peter Constable/IntlAdmin/WCT, <> AT
       cc: <> AT Internet@Ccmail, <>
             AT Internet@Ccmail
       Subject: Re: Encoding Bengali Vowel forms (again) wrote:
> As usual, I cannot stop spitting my little word :-|

       I believe I am as bad as you are. :-|.

> Abdul Malik wrote in his report:
> > Conclusion
> > ?Vowel A_zophola_AA? and ?Vowel E_zophola_AA? need to be
> > included in the Bengali Unicode range as separate vowels.
> > [...]
> I have no opinions about accepting or not this proposal.

       Neither do I. However, on OpenType, Apurva Joshi (who I believe
       is also on this list) did comment that this would be a much
       solution that the existing state of affairs (i.e. using
       after a vowel).

> What I think, however, is that it is wrong to say that such a
       change is > *needed* for encoding Bengali.

       As you, I do not believe this is *needed*. BUT, I believe the
       should be sorted out, in order to provide correct rendering
       for Bengali, adapted to the choosen solution. Whether the
       for A-ya will be [\u0985\u09CD\u09AF\u09BE or \u0991 as
       proposed by Abdul or \u098D as used (indirectly) in CDAC
       products, or \u09Fx as suggested by Apurva], the current
       products need to be adjusted anyway.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT