RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 08 2003 - 15:12:37 CST


Gautam suggested:

> You are absolutely right. I am suggesting that the
> language-specific viramas be retained as
> script-specific *explicit* viramas that never
> disappear. In addition, let's have a script-specific
> ZWJ which behaves in the way you describe in the
> preceding paragraph. The explicit virama (rather the
> ONLY virama) will never appear after a vowel, but the
> language-specific ZWJ will, as in <A><ZWJ><Y><AA>
> encoding A+YOPHOLA+AA. The cost is just one additional
> code point for each script.

The "cost" is not measured in code points, but in change
of model, change of implementations, normalization of
data, mismatching and failures of searches on data
represented differently, and on and on...

> Note that we will no
> longer need the combining vowels or an additional code
> point for YAPHOLA.

> I agree with you on all of these issues. You have in
> fact summed up my critique of the ISCII/Unicode model.
> The only point I'd like to add here is that these
> mistakes were avoidable and should have been avoided.
> There can be no excuses for placing the Assamese r and
> v the way they are currently placed. The same goes for
> the long syllabic R and L.

Placement in the code charts is, however, irrelevant to the
correct ordering of strings represented using those
code points. That is done by a collation algorithm with
weight tables -- not by the presumed mechanism of binary
ordering implied by ISCII.

> > But, all summed up, leaving with these little flaws
> > is *much* simpler than
> > trying to change the rules of a standard a dozen
> > years after people started
> > implementing it.
>
> Take a second look.

Marco is, however, absolutely correct in his overall assessment
here.

> My suggestion amounts to:
>
> 1. retaining the script-specific virama as it is. Its
> existing behavior remains unchanged. I rename it as
> "(script-specific) ZWJ" merely for my convenience and
> conceptual clarity.
>
> 2. extending the role of this script-specific ZWJ to
> encode combining forms of vowels in CV sequences,
> entirely in line with the way it is used to encode CC
> ligatures.
>
> [1 and 2 may sound somewhat different from what I have
> suggested above, but they are in effect the same].
>
> 3. introducing a script-specific explicit virama,
> which we can very well afford after getting rid of all
> the combining forms of vowels.

"Affording" this has nothing to do with available code
points. The problem is the reconstruction of the text
model. And "getting rid of all the combining forms of
vowels" would be a radical reconstruction of the text
model -- something which the Unicode Standard simply
cannot accomodate.

>
> 4. getting rid of *all* precomposed forms including
> the recent innovations in Devanagari that are used
> only for transliteration. These not only fill up the
> code space of Devanagari but also put constraints on
> the placement of characters in the code spaces of
> other Indian scripts.

Again, "filling up the code space" has nothing to do
with the assessment.

>
> How much recoding would these changes involve?

Extensive. And *any* recoding of Unicode characters is
simply disallowed by the stability guarantees associated
with the standard:

http://www.unicode.org/standard/stability_policy.html

> Would
> the cost be really unacceptable?

Yes. Absolutely.

--Ken

>
> Best, Gautam



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST