Re: Bangla: [ZWJ], [VIRAMA] and CV sequences

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Oct 09 2003 - 19:02:22 CST


Gautam asked:

> I stand corrected. Long syllabic /r l/ as well as
> Assamese /r v/ are indeed additions beyond the ISCII
> code chart. My objection, however, was not against
> their inclusion but against their placement. I
> understand why long syllabic /r l/ could not be placed
> with the vowels, but why were Assamese /r v/ assigned
> U+09F0 and U+09F1 instead of U+09B1 and U+09B5
> respectively?

Because the 7th and 8th rows in each of these Indic
scripts was where additions beyond the ISCII repertoire
were added.

> > In the case of the Assamese letters, these
> > additions separate out the *distinct* forms for
> > Assamese /r/ and /v/ from the Bangla forms, and
> > *enable* correct sorting, rather than inhibiting it.
>
> I fail to understand why Assamese /r v/ wouldn't be
> correctly sorted if placed in U+09F0 and U+09F1.

I presume you mean U+09B1 and U+09B5.

The answer is that no Indic script is correctly sorted
simply by using code point order, anyway. You need
a more sophisticated algorithm. And since such an
algorithm will have weight tables, it doesn't *matter*
where a particular character is in the code chart.

See:

http://www.unicode.org/notes/tn1/

for a discussion of these issues.

> Why
> do they need to be separated out from the Bangla forms
> in order to enable correct sorting?

So that a tailored sorting for Assamese can be based
on Assamese letters, and a tailored sorting for Bangla
can be based on Bangla letters.

>
> > The addition of the long syllabic /r/ and /l/
> > *enables* the representation of Sanskrit
> > material in the Bengali script, and the code
> > position in the charts is immaterial.
>
> As stated earlier, my objection is not against their
> inclusion, but against their positioning on the code
> chart. Why is their relative position in the chart
> immaterial for sorting?

See the above technical note. If it will help you visualize
the answer in some way, here is an excerpt from the
Default Unicode Collation Element Table for the
Unicode Collation Algorithm (Version 4.0), showing the
default weight assignments for the relevant portion of the
Bengali script:

09AA ; [.15C4.0020.0002.09AA] # BENGALI LETTER PA
09AB ; [.15C5.0020.0002.09AB] # BENGALI LETTER PHA
09AC ; [.15C6.0020.0002.09AC] # BENGALI LETTER BA
09AD ; [.15C7.0020.0002.09AD] # BENGALI LETTER BHA
09AE ; [.15C8.0020.0002.09AE] # BENGALI LETTER MA
09AF ; [.15C9.0020.0002.09AF] # BENGALI LETTER YA
09DF ; [.15C9.0020.0002.09AF][.0000.00FD.0002.09BC] # BENGALI LETTER YYA; QQCM
09B0 ; [.15CA.0020.0002.09B0] # BENGALI LETTER RA
09F0 ; [.15CB.0020.0002.09F0] # BENGALI LETTER RA WITH MIDDLE DIAGONAL <---
09B2 ; [.15CC.0020.0002.09B2] # BENGALI LETTER LA
09F1 ; [.15CD.0020.0002.09F1] # BENGALI LETTER RA WITH LOWER DIAGONAL <---
09B6 ; [.15CE.0020.0002.09B6] # BENGALI LETTER SHA
09B7 ; [.15CF.0020.0002.09B7] # BENGALI LETTER SSA
09B8 ; [.15D0.0020.0002.09B8] # BENGALI LETTER SA
          ^^^^
          primary weights, in sorted order
          
As you can see, the two additional letters in question,
in the default table, sort in exactly the order you
are suggesting, and as I said, the position in the
*code chart* doesn't matter.

> If it is merely because there
> are script-specific sorting mechanisms already in
> place, then it's just a bad excuse for a sloppy job. I
> sincerely hope there is more to it than just that.

It truly does not matter. *No* script in the Unicode
Standard is encoded completely in a collation order.
*All* scripts must be handled via weight tables in
order to produce desired sorting behavior. That is
true for Latin, Greek, Cyrillic, ..., as well as Devanagari,
Bengali, Gujarati, ..., so this is nothing particularly
different about the encoding of Bengali.

>
> > But be that as it may, they (TDIL) have nothing to
> > do with the code point choices in the range
> > U+09E0..U+09FF ...
>
> If this is indeed the case, then I must say it's
> rather unfortunate. As a full corporate member
> representing the Republic of India, the Ministry of
> Information Technology should have had a BIG say in
> the matter. Were they ever consulted on the issue?

Of course, once they got involved. And they have been
making suggestions ever since. But you need to recognize
that the particular characters you are concerned about
were standardized and published by ISO in 1993 (based,
it is true, on charts published by Unicode even earlier,
which in turn were based on the ISCII standard),
well before the Government of India became a member of
the Unicode Consortium.

--Ken

> Did
> they try to intervene suo moto? Will a Unicode
> official kindly let us know? Best, -Gautam.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST