L2/14-281 Title: Rationale for Atomic Encoding of Murmured Resonants in Newa Author: Ken Whistler Date: October 27, 2014 Action: For consideration by the UTC Summary This document provides a rationale for proposing the atomic encoding of the 6 murmured (breath voiced) resonants in the Newa script used for the writing of the Nepal Bhasa language. The "Background Analysis" section cited below presents a longer analysis I wrote of the situation for the modern orthography preferred by the community of Nepal Bhasa speakers consulted recently by SEI and reported on in L2/14-253. In the interest of getting something on the record for the UTC I have basically just cited that analysis (from on email discussion thread on October 14, 2014) as is. The 6 atomic characters proposed for murmured resonants include one more than cited below, for nyha, a relatively rare sound -- but the analysis does not depend on how many points of articulation are involved. The argument can be exemplified with just a single point of articulation, e.g. the bilabial murmured nasal /mʱ/. ======================================================================== Rationales: Against and For The argument *against* an atomic encoding for the murmured resonants for Nepal Bhasa essentially makes the following points: 1. They are not necessary, because they are represented in the writing system with conjunct forms -- and the way conjuncts are represented in the Indic model is with sequence. 2. If they were encoded, there would be ambiguities of representation, with two ways to represent the same thing in the Newa script. 3. User expressed preferences don't matter. We have analogous situations, even in Devanagari, where an entity taught and thought of as a single element (e.g. ksha) is formally displayed by a conjunct, and that conjunct, consistent with the Indic model, is represented by a sequence, instead of being atomically encoded. Doing otherwise for the Newa script would "break the model" for Indic. ------------------------------------------------------------------------- The argument *for* an atomic encoding for the murmured resonants for Nepal Bhasa essentially makes the following points: 1. There are (at least) two orthographies involved here: one traditional orthography used for the representation of Sanskrit in Newa (~ traditionally, in English, "Newari"), and a second, somewhat reformed orthography preferred *specifically* for modern use for the representation of the Nepal Bhasa language in Nepal -- a non-Indo-European, Tibeto-Burman language. 2. Nepal Bhasa has murmured resonants which Sanskrit does not, and *all* of the written orthographies for Nepal Bhasa (Latin, Devanagari, and Newa) have some innovations to deal with them. (See the background analysis section below.) 3. Although the written forms preferred in the Newa orthography for Nepal Bhasa are quite obviously based on conjunct letters involving the ha plus the resonant letter, they have been *reanalyzed* as individual letters by the primary stakeholders in the modern orthography, as reported by L2/14-253. 4. The addition of 6 atomic characters for Nepal Bhasa does not actually destabilize the representation of historical (or even newly created) Sanskrit data in the Newa script, because that data would simply not use those 6 atomic characters. All of the characters needed for that Sanskrit text are fully provided for in the encoding, and all Sanskrit conjuncts would simply be represented by sequences, consistent with Sanskrit practice for other Indic scripts. 5. No automatic equivalences between (or other similar sequences), and the atomic characters for Nepal Bhasa are posited, proposed (or needed) for the encoding, so there is no *formal* normalization equivalence issue here at all. The only potential issue here is a potential glyphic confusability issue for the rendered text. That kind of problem is actually a common issue in many scripts, and cannot really be considered model-breaking. 6. All encoding of repertoires of characters for scripts in the Unicode Standard are ultimately engineering constructs that seek a balance between multiple criteria, and in some cases where there are multiple orthographies supported by a script, there can be cross-cutting criteria that may affect decisions about the "best" design for the script. The Newa script seems to be such a case. 7. There are very strong political linguistic factors involved in this particular case, involving community ethnic identity and other concerns. The atomic encoding of the murmured resonants has become a specifically politicized identification issue for the users of the script, and this kind of factor cannot be ignored when deciding on the encoding. Failing to take such expressed concerns into account when deciding on a character encoding has the potential for blocking or delaying the encoding -- not a desirable outcome for any of the stakeholders involved. 8. The position stating that murmured resonants *must* be represented by conjunct sequences instead of atomic characters is viewed by the Nepal Bhasa stakeholders as specifically forcing Nepal Bhasa into a "Sanskrit box". That position is a non-starter for that community, for a variety of reasons. 9. Given the depth of feeling involved in #7 and #8, in my opinion, an argument that appeals to Indic model purity for the Newa script doesn't cut it, if the adaptations required to the script to cover the additional encoding requirements do not actually cause implementation problems. I contend that in this case they do not, because they simply add 6 more atomic consonants to the encoding -- atomic consonants which can simply be excluded from further conjunct formation rules, and which thus behave essentially no differently than other simple letters in the script. 10. Encoding the 6 murmured resonants atomically in Newa provides a distinct advantage for collation for the script. Their collation is irrelevant to Sanskrit data. But for Nepal Bhasa, they simply get their primary weight from their alphabetic order default in the charts, which means that even non-tailored collations will do a better job. And having them as atomically encoded consonants eliminates the need for adding contractions of conjunct sequences , etc., to the tailorings for Nepal Bhasa. ------------------------------------------------------------------------ Scenarios and Other Observations Ultimately, deciding between the two approaches on the table -- to either encode a set of murmured resonants as atomic characters or to require representation of them with conjunct sequences -- should involve walking through the relevant scenarios for input, text representation, and rendering in detail, for both Sanskrit and Nepal Bhasa language data. I cannot work through all of the permutations exhaustively here. However, a few other observations come to mind. 1. The written orthography of Nepal Bhasa in the *Devanagari* script already involves significant departures from the normal sequences for syllables that one would expect for Sanskrit data. These innovations for Nepal Bhasa can be inferred by examining the syllable chart in part 3 of the document "Towards a Consensus Encoding of Newa," where the innovations for the representation of syllables involving long diphthongs are mirrored in both Devanagari and in the Newa script. Given those non-Sanskrit innovations in the writing system(s), having additional consonants encoded atomically in Newa for the murmured resonants is not actually that much of a stretch. 2. The claim that atomic encoding of murmured resonants breaks the model, and that they *must* be encoded as Sanskrit-like conjunct sequences because of their formal appearance, essentially amounts to a claim that the same Nepal Bhasa content *must* be represented with identically analogous sequences in both the Devanagari script and the Newa script. The argument goes that because Devanagari does not have atomically encoded characters for conjuncts involving sequences of ha with other consonants, therefore Newa (or other Indic scripts) also cannot have them. But I consider this a fallacious argument -- it ultimately devolves to an assertion that the transliteration relationship between Nepal Bhasa written in Devangari and Nepal Bhasa written in Newa must be strictly one-to-one. But that kind of constraint on transliteration between scripts is not something we would enforce between *other* scripts -- notably between the Latin script and Newa, for example. (Again, see the Background Analysis section below.) ======================================================================== Background Analysis [Extracted from email discussion thread, October 14, 2014.] This is really not a whole lot different than the typical kinds of graphological problems which arise when a language of a very different type and phonology adopts a well-established writing system that has a long history of adaptation to a different kind of language. See, e.g., the Japanese adaptation of Chinese writing, or the extension of Arabic to (try to) work for languages of West Africa (with tones and prenasalized consonants, etc., etc.). For Newar, we have the problem of how to represent the murmured nasals and other liquids. Without firsthand knowledge of Newar, but judging by reading between the lines of what Kashinath Tamot wrote, there are 4 murmured phonemes which need to be distinctly represented here. The *phonological* units are: /m̤/, /n̤/, /ŋ̤/, /r̤ ~ l̤/ Where I won’t worry about the precise articulation of the “r”, nor do I really care about the exact conditions of the alternations for the “r” or “l”. I’m using the two dot below IPA murmur diacritic. You could just as well indicate this with a modifier letter: /mʱ/, /nʱ/, /ŋʱ/, /rʱ ~ lʱ/ Or, depending on your analysis and possibly the examination of detailed acoustic date, might use the modifier letter prepended, just as well: /ʱm/, /ʱn/, /ʱŋ/, /ʱr ~ ʱl/ So phonologically, that is what people want to represent. O.k., so *next* we deal with how this gets represented in Latin transliterations. Again, judging from Tamot’s discussion, it appears that in older materials, these are represented by *5* digraphs (or trigraphs), Sanskrit rank order: hng, hn, hm, hr, hl And that the SIL linguists turned those around and used the h after the consonant: ngh, nh, mh, rh, lh Presumably that is part of a practical Latin orthography which just treats the “h” as parallel to the “h” for the breathy voiced gh, dh, bh, for consistency. And then you also have the Indologist convention of writing the velar nasal as ṅ, so you have alternative representations using that with an h on either side. O.k., now the next problem is how these are represented in the orthography of a Brahmi-derived script, for which there aren’t going to be any atomic graphemic units historically availably, unlike the graphemic units for gh, dh, bh. So what do people do? They innovate using the mechanisms they have, and end up doing the combinatory equivalent of the 3 different phonological transcriptions above. 1. Write with notionally atomic conjunct forms. 2. Write with explicit sequences of , which may render as a ligated conjunct. 3. Write with explicit sequences of , which may render as a ligated conjunct. And I’m guessing that this is then also informed by what Devanagari supports for these kinds of special conjuncts, in particular: hma, hya, hla, hva, … with the ha as the C1 of the sequences. Looking back through Anshuman’s proposal carefully, he has a complete set of conjuncts for all of these – actually for 7, since he includes ny and ṇ, as well. Apparently, the preferred current forms use the ligated conjuncts, but *interpret* them as equivalent to mh, nh, etc., perhaps because of the inconsistent transliteration/Romanization conventions. This is part of why Anshuman is cautioning in p. 6 about fonts having to render two sequences as the same conjunct. If you look at the 5 relevant forms in the bottom row of Figure 28 (p. 48) of L2/12-003r, they are using the ligated conjuncts in Newar, but then transliterate them as (mostly with unligated half-form renderings) in Devanagari and as ngha, nha, mha, rha, lha in the Romanization. As a further fly in the ointment here, note that that same online chart for Newar has separate entries for ksa, tra, and gyan in the row just above the murmured resonants, but nobody is holding forth for having to represent *those* with atomic consonant letters, presumably! I think we are in a transitional orthographic situation, where a community of modern users have decided that the “letters” for the murmured resonants really need to match their phonological status as unit phonemes. In other words, there is a very strong feeling here that the aksaras must contain a single consonant unit, even if the graphical form ultimately is derived from ligatures in the history of the script. Note that their claim on atomicity is stronger than for ksa, tra, and gyan, as presumably in those cases, the phonological analysis is going to support a treatment as a sequence of phonemes, and the aberrant aspect is the visual unanalyzability of the written forms. It seems to me that for Prachalit, there would be no particular need to posit atomic ngha, nha, mha, rha, and lha – you could get by as Anshuman recommends just using whichever sequence appears to underlie the relevant or combination. But for application to modern writing of Newar, I tend to agree that the case for independent letterhood is pretty strong. And I think you end up with potential problems, whichever direction you take. 1. Non-atomic. Force use of sequences in all cases. The problem here comes from the issue noted on p. 6 of Anshuman’s document. Formally, the preferred ligated conjuncts are , etc., but they may then be interpreted as equivalent to forms written in Devanagari as and Romanized as ngha. And formally the ligature in the Newar script would look different than what people expect for modern writing. That can lead to representation and interpretation conundrums. It could lead to confusion in input methods, perhaps. 2. Atomic. Code 5 atomic units. The problem here comes from having alternative, non-canonically equivalent spellings for the “same thing”. On the other hand, this solution has a lot going for it, in that it gives you a well-behaved atomic encoding for the 5 entities (4 phonemes), which should work reliably even while you end up arguing in text about how other sequences of versus should be displayed or Romanized. I suspect that the second scenario would turn out to be somewhat more robust for users. It is also more tractable to collapse together different distinct representations for the purposes of comparison and searching than it would be to try to disentangle possibly overlapping representations that might or might not be treated as “the same”, which might be where you’d end up with the first scenario. The second scenario also has the obvious advantage that it is the explicit preference of a vocal user community. That doesn’t always mean it is the best choice, technically. But in this case, I don’t think the non-atomic approach is a completely clean and obviously better solution, given the actual nature of the innovations in the modern script usage.