RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

From: Unicode (public) (Unicode-mail@las-inc.com)
Date: Wed Oct 08 2003 - 15:26:47 CST


Gautam--

>Take a second look. My suggestion amounts to:
>
>1. retaining the script-specific virama as it is. Its
>existing behavior remains unchanged. I rename it as
>"(script-specific) ZWJ" merely for my convenience and conceptual
clarity.
>
>2. extending the role of this script-specific ZWJ to
>encode combining forms of vowels in CV sequences,
>entirely in line with the way it is used to encode CC ligatures.
>
>[1 and 2 may sound somewhat different from what I have suggested above,
but they are in effect the same].
>
>3. introducing a script-specific explicit virama,
>which we can very well afford after getting rid of all
>the combining forms of vowels.
>
>4. getting rid of *all* precomposed forms including
>the recent innovations in Devanagari that are used
>only for transliteration. These not only fill up the
>code space of Devanagari but also put constraints on
>the placement of characters in the code spaces of
>other Indian scripts.
>
>How much recoding would these changes involve? Would
>the cost be really unacceptable?

Yes, the cost is really unacceptable.

Two of the most basic Unicode stability policies dictate that character
assignments, once made, are never removed and character names can never
change. Step 4 cannot happen; the best that can happen is that the code
points in question can be deprecated. The renaming you suggest in 1
cannot happen either.

The change in the encoding model for the virama can't happen either;
there are too many implementations based on it, and there are too many
documents out there that use the current encoding model. Your
suggestion wouldn't make them unreadable when opened with software that
did things the way you're suggesting, but it would change their
appearance in ways that are unlikely to be acceptable.

[I preface what follows with the observation that I'm not by any stretch
of the imagination an expert on Indic scripts, but I do fancy myself an
expert on Unicode.]

I'm also pretty sure that using ZWJ as a virama won't work and isn't
intended to work. KA + ZWJ + KA means something totally different from
KA + VIRAMA + KA, and I, for one, wouldn't expect them to be drawn the
same. U+0915 represents the letter KA with its inherent vowel sound;
that is, it represents the whole syllable KA. Two instances of U+0915
in a row would thus represent "KAKA", completely irrespective of how
they're drawn. Introducing a ZWJ in the middle would allow the two
SYLLABLES to ligate, but there's no ligature that represents "KAKA", so
you should get the same appearance as you do without the ZWJ. The
virama, on the other hand, cancels the vowel sound on the KA, turning it
into K: The sequence KA + VIRAMA + KA represents the syllable KKA, again
irrespective of how it is drawn.

In other words, ZWJ is intended to change the APPEARANCE of a piece of
text without changing its MEANING (there are exceptions in the Arabic
script, but this is the general rule). Having KA + ZWJ + KA render as
the syllable KKA would break this rule: the ZWJ would be changing the
MEANING of the text.

Whether the syllable KKA gets drawn with a virama, a half-form, or a
ligature is the proper province of ZWJ and ZWNJ, and this is what
they're documented in TUS to do. But ZWJ can't (and shouldn't) be used
to turn KAKA into KKA.

Maybe it was unfortunate to call U+094D a "virama," since it doesn't
necessarily get drawn as a virama (or, indeed, as anything), but it's
too late to revisit that decision. For that matter, it may have been a
mistake to use the virama model to encode conjunct forms in Bengali, but
it's too late to change that now. Real users generally shouldn't have
to care, though; this is an issue for programmers and font designers.
Their lives may be harder than they should have been, but unless it's
horribly hard for them to produce the right effects for their users, it
isn't worth it to reopen the issue of Unicode encoding of Indic scripts,
especially the ones that have been in Unicode for more than a decade
now.

There are lots of things that suck about Unicode, but on the whole, it's
way better than what came before and solves more problems that it
creates. Backward compatibility is a pain in the butt, and it forces us
to live with a lot of mistakes and suboptimal solutions we wish we
didn't have to live with. But backward compatibility is also good-- it
means the solution was good enough in the first place that people are
using it.

--Rich Gillam
  Language Analysis Systems, Inc.
  "Unicode Demystified"



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST