L2/99-026
UTC/1999-001

Subject: 3.0 Devanagari--Eyelash RA (3rd version)

                                           Wednesday, January 27, 1999

PREFACE

     This is a purely personal proposal from a personal member of the
Unicode Consortium.  It starts with a description of how Devanagari script
works for those unfamiliar with it, and its present encoding.  It then
describes certain problems and proposes solutions.  It closes with
discussion of some alternative solutions.

DEVANAGARI SCRIPT

     Devanagari script is basically syllabic.  It is based on analysis of
the sound system of Sanskrit done in the 4th century B.C.  Western
phonetic and grammatical analysis only surpassed it in the 19th century
after linguists learned of Indian efforts.

     1. Devanagari is basically a syllabic script.  Unlike English the
spoken and written forms are well synchronized.

     2. A syllable consists of either a vowel alone or one or more 
consonants and a vowel.  Syllables are generally 'open'--like Italian,
they tend not have any final consonant.  The vowel of a syllable with a
consonant is either implicit or explicit. The implicit vowel sounds like
the 'a' or 'o' in 'another'.  (A syllable can also have 'other signs' but
all but one, virama, are outside the scope of this proposal.)  A syllable
containing a vowel alone is called an independent vowel; there are 18 at
U+0905 to U+0914 and U+0960 to U+0961. There are 17 dependent vowel signs
written above, below or beside a consonant to override the implicit vowel;
they are encoded at U+093E to U+094C and U+0962 to U+0963.  The one more
independent vowel is for the implicit vowel when it begins a syllable. The
45 consonants with their implicit vowel are encoded at U+0915 to U+0939 and
U+0958 to U+095F--note that both their names and their pronunciation are
'ka', 'kha' etc. Looking at the Devanagari chart, one also notes that
these 45 can be graphically categorized: a. 27 have a vertical bar on the
right (U+0916 to U+0918, U+091A, U+091C to U+091E, U+0923 to U+0925,
U+0927 to U+092A, U+092C to U+092F, U+0932, U+0935 to U+0938, U+9059 to
U+095B and 095F); b. four have a vertical bar in the middle (U+0915,
U+092B, U+0958 and U+095E); c. 14 'round bottom consonants' do not have a
vertical bar (U+0919, U+091B, U+091F to U+0922, U+0926, U+0930 to U+0931,
U+0933 to U+0934, U+0939 and U+095C to U+095D).      

     3. When a syllable consists of more than one consonant, e.g., 'sta'
the first consonant loses its implicit vowel and is called a dead
consonant.  These combinations are called conjunct consonants or a
conjunct cluster. Consonants in a cluster can be rendered in several ways:
     a. With a character beneath them called a virama or halant (U+094D)
that signals removal of the implicit vowel.  This is considered the least
desirable typographically.
     b. For consonants with a right vertical bar, the vertical bar
signaling the implicit consonant is removed and the left portion called
the half-form is shifted right against the other consonant, see Unicode
2.0 table 6-10 for some half forms.  These are also called fused forms. 
     c. For consonants with a center bar, the part to the right of the bar
is removed and the remainder is shifted against the other consonant, see
table 6-10 again.
     d. The two or more consonants are combined into a ligature which may
or may not resemble its parts, see 2.0 table 6-11 for sample ligatures.
Note that in 24 of these 37 sample ligatures the first consonant has a
round bottom so methods b and c above cannot be used.  (The three items in
the table with nothing in the second column are consonant + vowel sign
combinations where the position of the vowel sign is abnormal.)   
     e. One consonant is transformed to a combining character and 
superimposed above or below another character.  This occurs mainly
with the consonant RA.  When RA is the first consonant in a conjunct
cluster it generally becomes a hook called reph above the last consonant.
See 2.0 R2 on page 6-39. When RA is the last consonant of a conjunct it
appears as a combining character below the preceding consonant. The shape
of this combining character depends on whether the preceding consonant has
a vertical bar or not: if there is a vertical bar the combining RA appears
as a prop on the left side of the preceding consonant (see R7 on page
6-40); otherwise the combining RA appears as a flattened inverted 'v'
beneath the round bottom consonant (see R6 on same page).
     f.  When a conjuncts cluster has more than two consonants more than
one of the above may be relevant.

     4.  When encoded in Unicode all but the last consonant of a conjunct
cluster have a virama following them, leaving rendering of the conjunct
to display software. This is unambiguous, consistent with ISCII practice
and greatly reduces the number of codes needed.  Better Devanagari fonts
can have several hundred elements with which software builds the various
conjuncts.

THE PROBLEM

     The above phonetic encoding works well except: 1. When the writing
system is not phonetic--orthographic exceptions occur; or 2. When one
wants to represent other than plain text--for example, a treatise *about*
Devanagari that shows constituent parts of certain combinations separately.

     In Unicode 2.0 there are two occurrences of the first condition.
First, in figure 6-12 the fourth example which claims to represent a dead
RA followed by the independent (nominal) form of the vocalic R
(U+090B) which together are rendered as the independent form of the vowel
sign R with the superscript (reph) form of RA.  It actually
represents the nominal form of RA followed by the dependent vowel sign R
(U+0943) which usually displays as a small 'c' beneath a consonant but in
this very rare case displays as shown--the independent vowel vocalic R with
the superscript RA (reph) above it. The second is our friend the 'eyelash
RA'. In 3.e above I say that dead RA as the first character of a conjunct
*generally* appears as a hook (reph) above the next consonant.  There is
an exception:  In Marathi, when the combination of dead RA followed by YA
or HA is due to grammatical inflection the RA displays as a smile or
eyelash before the YA or HA; when the dead RA and YA or HA are semantically
part of the word in all contexts the dead RA displays as the usual
RAsup--reph.  I quote from 'Marathi Language Course' by H.M.Lambert
(Oxford University Press, 1943, p.122): "The letter [RA virama]
immediately preceding another consonant is combined with it in two
different ways. When the combined letter consisting of [RA] and [YA]
occurs in all forms of the word, as in the examples in Lesson 5, [RA] is
written as [reph], for example ... When however, [RA] is followed
immediately by [YA] as a result of the inflection of a word which does not
have the combined consonant [RA]-[YA] in all its forms, the sign for [RA]
is written before, not above the consonant which follows." Later in her
'Introduction to the Devanagari Script' (Oxford University Press, 1953,
p.125): "A special form of reph is written to represent [RA virama]
preceding [YA] or [HA] in Marathi words.  This form of reph is sometimes
written with [HA] in Sanskrit loan words, but it is not usual to write
this form in a Sanskrit text ... The writing of reph in this form with
[YA] is usually restricted to Marathi words in which the combination of
[RA] and [YA] arises from grammatical processes." See R5 (2.0 page 6-39) for
current Unicode procedures (use of ZWJ) and display of the eyelash RA.
For rendering some special procedure is needed since even knowing the text
was Marathi would not suffice, one would need to consult a Marathi
dictionary to try to learn whether or not the word's root contained the RA
+ virama + (YA or HA) combination.
  
     The second problem arises when a text is *about* Devanagari 
characters rather than plain text in Devanagari.  Figure 6-13 shows use of
ZWNJ to get the dead form of a consonant rather than customary fused form
shown as the third example of figure 6-12.  I believe KAn + Virama + space
would be an easier way to get the same result--one that avoided imbedding a
control character in the text string.  Figure 6-14 shows use of ZWJ to get
the half form of a consonant when rendering would otherwise get the
special fused form from the third example in figure 6-12--ZWJ functioning 
as an invisible letter to which the dead KA connects. In figure 6-15 ZWJ
is used to get an independent half form of GA, but doesn't show what follows
the ZWJ so one is left wondering what the next character is.  Glenn Adams
has said that like figure 6-15, rule R5 shows how to evoke the eyelash RA
as an independent half form only.  His logic is impeccable: an eyelash RA
cannot connect to the invisible character ZWJ and the YA or HA which
presumably would follow it.  But this would leave Unicode in an the
anamolus situation of having a way to evoke the rare occurrence of an
independent eyelash RA, but no means to produce the more common use of
eyelash RA as part of a conjunct cluster.       

     This approach has two difficulties:  1. Frequently a letter can have
more than one half form.  Especially for round bottom consonants the half
form depends on the preceding or following letter, so there is no one
independent half form and getting the desired half-form of by use of ZWJ
is unpredictable.  2. In R5 (2.0 page 6-39) ZWJ is assigned to to signal
that the eyelash RA is wanted, which means there is no way to force
software to create the independent RAsup (reph).

PROPOSED SOLUTIONS

     1. It is proposed that the R5 be changed to use RRA (U+0931) and
virama before YA or HA instead of RA (U+0930) and virama and ZWJ for the
eyelash RA, and note the practice of earlier versions of Unicode.  The
text might read: "If the dead consonant RRA(d) precedes YA or HA, then the
half consonant from RR(h) known as eyelash RA is used.  This glyph is
commonly used in writing Marathi.  Use of this convention follows ISCII
practice; earlier versions of Unicode used ZWJ to signal the eyelash
RA." The RRA(h) should be added to the section on notation.

     2. I would also move the fourth example in figure 9-3 to near the end
of table 9-2 and show it as formed from [RA] and vowel sign vocalic R.
The text relating to figure 9-3 would need alteration too.

     3. In Unicode 1.0 the chart for Devanagari included more than one
nominal form of several letters and numerals, not just consonants.  In
draft 3.0, page 176, the last sentence of the paragraph beginning "Some
Devanagari letters ..." says: "In certain cases, however, more than
one nominal from is depicted for a single character, where a common
stylistic alternate of a nominal form exists."  Since the chart no longer
depicts alternate forms the sentence should be omitted.

     4. In the charts of review draft 3.0, page, page 63, the note at 0931
might be changed to show its use for transcribing the Kannada, Telugu and
Malayalam RRA in Devanagari too.

ALTERNATE SOLUTIONS

     The above solution for the eyelash RA was recommended by Michael
Everson--I initially pointed out that the present convention had problems.
Lloyd Anderson has suggested defining a separate code because the
character is not phonetically a RRA.  Either solution (using RRA or a 
new code) would require careful reading of documentation and special attention
during creation of sorting software.  Using RRA has the virtue of matching
ISCII. Mark Davis has suggested that any dead RRA be displayed as an
eyelash RA, not just those followed by YA and HA.  It is my understanding
that when the Dravidian (not just Tamil) languages are given in
Devanagari, Indian practice is to display dead RRA with its dot and virama
unless the dead RA is followed by RA or HA.  (I have no idea how many
Southern Indians would read their language in Devanagari script.)  For
compatibility Mark Davis also wanted to retain the old ZWJ usage as a less
desirable option.  I think freeing the ZWJ to represent only RAsup (reph)
would be a preferable capability. 

     The more general topic of whether Unicode should provide for 
expressing parts of letters as in typographic and calligraphic texts, and,
if so, how to do so I leave for others on another day.   

     Thank you for your time and attention.

     Regards,
          Jim Agenbroad ( jage@LOC.gov )
     The above are purely personal opinions, not necessarily                      
the official views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.