L2/99-026 UTC/1999-001 Subject: 3.0 Devanagari--Eyelash RA (3rd version) Wednesday, January 27, 1999 PREFACE This is a purely personal proposal from a personal member of the Unicode Consortium. It starts with a description of how Devanagari script works for those unfamiliar with it, and its present encoding. It then describes certain problems and proposes solutions. It closes with discussion of some alternative solutions. DEVANAGARI SCRIPT Devanagari script is basically syllabic. It is based on analysis of the sound system of Sanskrit done in the 4th century B.C. Western phonetic and grammatical analysis only surpassed it in the 19th century after linguists learned of Indian efforts. 1. Devanagari is basically a syllabic script. Unlike English the spoken and written forms are well synchronized. 2. A syllable consists of either a vowel alone or one or more consonants and a vowel. Syllables are generally 'open'--like Italian, they tend not have any final consonant. The vowel of a syllable with a consonant is either implicit or explicit. The implicit vowel sounds like the 'a' or 'o' in 'another'. (A syllable can also have 'other signs' but all but one, virama, are outside the scope of this proposal.) A syllable containing a vowel alone is called an independent vowel; there are 18 at U+0905 to U+0914 and U+0960 to U+0961. There are 17 dependent vowel signs written above, below or beside a consonant to override the implicit vowel; they are encoded at U+093E to U+094C and U+0962 to U+0963. The one more independent vowel is for the implicit vowel when it begins a syllable. The 45 consonants with their implicit vowel are encoded at U+0915 to U+0939 and U+0958 to U+095F--note that both their names and their pronunciation are 'ka', 'kha' etc. Looking at the Devanagari chart, one also notes that these 45 can be graphically categorized: a. 27 have a vertical bar on the right (U+0916 to U+0918, U+091A, U+091C to U+091E, U+0923 to U+0925, U+0927 to U+092A, U+092C to U+092F, U+0932, U+0935 to U+0938, U+9059 to U+095B and 095F); b. four have a vertical bar in the middle (U+0915, U+092B, U+0958 and U+095E); c. 14 'round bottom consonants' do not have a vertical bar (U+0919, U+091B, U+091F to U+0922, U+0926, U+0930 to U+0931, U+0933 to U+0934, U+0939 and U+095C to U+095D). 3. When a syllable consists of more than one consonant, e.g., 'sta' the first consonant loses its implicit vowel and is called a dead consonant. These combinations are called conjunct consonants or a conjunct cluster. Consonants in a cluster can be rendered in several ways: a. With a character beneath them called a virama or halant (U+094D) that signals removal of the implicit vowel. This is considered the least desirable typographically. b. For consonants with a right vertical bar, the vertical bar signaling the implicit consonant is removed and the left portion called the half-form is shifted right against the other consonant, see Unicode 2.0 table 6-10 for some half forms. These are also called fused forms. c. For consonants with a center bar, the part to the right of the bar is removed and the remainder is shifted against the other consonant, see table 6-10 again. d. The two or more consonants are combined into a ligature which may or may not resemble its parts, see 2.0 table 6-11 for sample ligatures. Note that in 24 of these 37 sample ligatures the first consonant has a round bottom so methods b and c above cannot be used. (The three items in the table with nothing in the second column are consonant + vowel sign combinations where the position of the vowel sign is abnormal.) e. One consonant is transformed to a combining character and superimposed above or below another character. This occurs mainly with the consonant RA. When RA is the first consonant in a conjunct cluster it generally becomes a hook called reph above the last consonant. See 2.0 R2 on page 6-39. When RA is the last consonant of a conjunct it appears as a combining character below the preceding consonant. The shape of this combining character depends on whether the preceding consonant has a vertical bar or not: if there is a vertical bar the combining RA appears as a prop on the left side of the preceding consonant (see R7 on page 6-40); otherwise the combining RA appears as a flattened inverted 'v' beneath the round bottom consonant (see R6 on same page). f. When a conjuncts cluster has more than two consonants more than one of the above may be relevant. 4. When encoded in Unicode all but the last consonant of a conjunct cluster have a virama following them, leaving rendering of the conjunct to display software. This is unambiguous, consistent with ISCII practice and greatly reduces the number of codes needed. Better Devanagari fonts can have several hundred elements with which software builds the various conjuncts. THE PROBLEM The above phonetic encoding works well except: 1. When the writing system is not phonetic--orthographic exceptions occur; or 2. When one wants to represent other than plain text--for example, a treatise *about* Devanagari that shows constituent parts of certain combinations separately. In Unicode 2.0 there are two occurrences of the first condition. First, in figure 6-12 the fourth example which claims to represent a dead RA followed by the independent (nominal) form of the vocalic R (U+090B) which together are rendered as the independent form of the vowel sign R with the superscript (reph) form of RA. It actually represents the nominal form of RA followed by the dependent vowel sign R (U+0943) which usually displays as a small 'c' beneath a consonant but in this very rare case displays as shown--the independent vowel vocalic R with the superscript RA (reph) above it. The second is our friend the 'eyelash RA'. In 3.e above I say that dead RA as the first character of a conjunct *generally* appears as a hook (reph) above the next consonant. There is an exception: In Marathi, when the combination of dead RA followed by YA or HA is due to grammatical inflection the RA displays as a smile or eyelash before the YA or HA; when the dead RA and YA or HA are semantically part of the word in all contexts the dead RA displays as the usual RAsup--reph. I quote from 'Marathi Language Course' by H.M.Lambert (Oxford University Press, 1943, p.122): "The letter [RA virama] immediately preceding another consonant is combined with it in two different ways. When the combined letter consisting of [RA] and [YA] occurs in all forms of the word, as in the examples in Lesson 5, [RA] is written as [reph], for example ... When however, [RA] is followed immediately by [YA] as a result of the inflection of a word which does not have the combined consonant [RA]-[YA] in all its forms, the sign for [RA] is written before, not above the consonant which follows." Later in her 'Introduction to the Devanagari Script' (Oxford University Press, 1953, p.125): "A special form of reph is written to represent [RA virama] preceding [YA] or [HA] in Marathi words. This form of reph is sometimes written with [HA] in Sanskrit loan words, but it is not usual to write this form in a Sanskrit text ... The writing of reph in this form with [YA] is usually restricted to Marathi words in which the combination of [RA] and [YA] arises from grammatical processes." See R5 (2.0 page 6-39) for current Unicode procedures (use of ZWJ) and display of the eyelash RA. For rendering some special procedure is needed since even knowing the text was Marathi would not suffice, one would need to consult a Marathi dictionary to try to learn whether or not the word's root contained the RA + virama + (YA or HA) combination. The second problem arises when a text is *about* Devanagari characters rather than plain text in Devanagari. Figure 6-13 shows use of ZWNJ to get the dead form of a consonant rather than customary fused form shown as the third example of figure 6-12. I believe KAn + Virama + space would be an easier way to get the same result--one that avoided imbedding a control character in the text string. Figure 6-14 shows use of ZWJ to get the half form of a consonant when rendering would otherwise get the special fused form from the third example in figure 6-12--ZWJ functioning as an invisible letter to which the dead KA connects. In figure 6-15 ZWJ is used to get an independent half form of GA, but doesn't show what follows the ZWJ so one is left wondering what the next character is. Glenn Adams has said that like figure 6-15, rule R5 shows how to evoke the eyelash RA as an independent half form only. His logic is impeccable: an eyelash RA cannot connect to the invisible character ZWJ and the YA or HA which presumably would follow it. But this would leave Unicode in an the anamolus situation of having a way to evoke the rare occurrence of an independent eyelash RA, but no means to produce the more common use of eyelash RA as part of a conjunct cluster. This approach has two difficulties: 1. Frequently a letter can have more than one half form. Especially for round bottom consonants the half form depends on the preceding or following letter, so there is no one independent half form and getting the desired half-form of by use of ZWJ is unpredictable. 2. In R5 (2.0 page 6-39) ZWJ is assigned to to signal that the eyelash RA is wanted, which means there is no way to force software to create the independent RAsup (reph). PROPOSED SOLUTIONS 1. It is proposed that the R5 be changed to use RRA (U+0931) and virama before YA or HA instead of RA (U+0930) and virama and ZWJ for the eyelash RA, and note the practice of earlier versions of Unicode. The text might read: "If the dead consonant RRA(d) precedes YA or HA, then the half consonant from RR(h) known as eyelash RA is used. This glyph is commonly used in writing Marathi. Use of this convention follows ISCII practice; earlier versions of Unicode used ZWJ to signal the eyelash RA." The RRA(h) should be added to the section on notation. 2. I would also move the fourth example in figure 9-3 to near the end of table 9-2 and show it as formed from [RA] and vowel sign vocalic R. The text relating to figure 9-3 would need alteration too. 3. In Unicode 1.0 the chart for Devanagari included more than one nominal form of several letters and numerals, not just consonants. In draft 3.0, page 176, the last sentence of the paragraph beginning "Some Devanagari letters ..." says: "In certain cases, however, more than one nominal from is depicted for a single character, where a common stylistic alternate of a nominal form exists." Since the chart no longer depicts alternate forms the sentence should be omitted. 4. In the charts of review draft 3.0, page, page 63, the note at 0931 might be changed to show its use for transcribing the Kannada, Telugu and Malayalam RRA in Devanagari too. ALTERNATE SOLUTIONS The above solution for the eyelash RA was recommended by Michael Everson--I initially pointed out that the present convention had problems. Lloyd Anderson has suggested defining a separate code because the character is not phonetically a RRA. Either solution (using RRA or a new code) would require careful reading of documentation and special attention during creation of sorting software. Using RRA has the virtue of matching ISCII. Mark Davis has suggested that any dead RRA be displayed as an eyelash RA, not just those followed by YA and HA. It is my understanding that when the Dravidian (not just Tamil) languages are given in Devanagari, Indian practice is to display dead RRA with its dot and virama unless the dead RA is followed by RA or HA. (I have no idea how many Southern Indians would read their language in Devanagari script.) For compatibility Mark Davis also wanted to retain the old ZWJ usage as a less desirable option. I think freeing the ZWJ to represent only RAsup (reph) would be a preferable capability. The more general topic of whether Unicode should provide for expressing parts of letters as in typographic and calligraphic texts, and, if so, how to do so I leave for others on another day. Thank you for your time and attention. Regards, Jim Agenbroad ( jage@LOC.gov ) The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.