L2/14-281


Title: Rationale for Atomic Encoding of Murmured Resonants in Newa
Author: Ken Whistler
Date: October 27, 2014
Action: For consideration by the UTC

Summary

This document provides a  rationale for proposing the
atomic encoding of the 6 murmured (breath voiced) resonants in
the Newa script used for the writing of the Nepal Bhasa language.

The "Background Analysis" section cited below presents a longer
analysis I wrote of the situation for the modern orthography preferred
by the community of Nepal Bhasa speakers consulted recently by SEI and
reported on in L2/14-253. In the interest of getting something on
the record for the UTC I have basically just cited that analysis
(from on email discussion thread on October 14, 2014) as is.
The 6 atomic characters proposed for murmured resonants include
one more than cited below, for nyha, a relatively rare sound -- but
the analysis does not depend on how many points of articulation
are involved. The argument can be exemplified with just a single
point of articulation, e.g. the bilabial murmured nasal /mʱ/.

========================================================================

Rationales: Against and For

The argument *against* an atomic encoding for the murmured resonants
for Nepal Bhasa essentially makes the following points:

1. They are not necessary, because they are represented in the writing
system with conjunct forms -- and the way conjuncts are represented in
the Indic model is with <C1, virama, C2> sequence.

2. If they were encoded, there would be ambiguities of representation,
with two ways to represent the same thing in the Newa script.

3. User expressed preferences don't matter. We have analogous situations,
even in Devanagari, where an entity taught and thought of as a single
element (e.g. ksha) is formally displayed by a conjunct, and that
conjunct, consistent with the Indic model, is represented by a
<C1, virama, C2> sequence, instead of being atomically encoded. Doing
otherwise for the Newa script would "break the model" for Indic.

-------------------------------------------------------------------------

The argument *for* an atomic encoding for the murmured resonants
for Nepal Bhasa essentially makes the following points:

1. There are (at least) two orthographies involved here: one traditional
orthography used for the representation of Sanskrit in Newa (~ traditionally,
in English, "Newari"), and a second, somewhat reformed orthography preferred
*specifically* for modern use for the representation of the Nepal Bhasa
language in Nepal -- a non-Indo-European, Tibeto-Burman language.

2. Nepal Bhasa has murmured resonants which Sanskrit does not, and
*all* of the written orthographies for Nepal Bhasa (Latin, Devanagari,
and Newa) have some innovations to deal with them. (See the background
analysis section below.)

3. Although the written forms preferred in the Newa orthography for Nepal
Bhasa are quite obviously based on conjunct letters involving the
ha plus the resonant letter, they have been *reanalyzed* as individual
letters by the primary stakeholders in the modern orthography, as reported
by L2/14-253.

4. The addition of 6 atomic characters for Nepal Bhasa does not actually
destabilize the representation of historical (or even newly created) Sanskrit 
data in the Newa script, because that data would simply not use those 6
atomic characters. All of the characters needed for that Sanskrit text
are fully provided for in the encoding, and all Sanskrit conjuncts would
simply be represented by <C1, virama, C2> sequences, consistent with
Sanskrit practice for other Indic scripts.

5. No automatic equivalences between <ha, virama, ma> (or other similar
sequences), and the atomic characters for Nepal Bhasa are posited, proposed (or
needed) for the encoding, so there is no *formal* normalization equivalence
issue here at all. The only potential issue here is a potential glyphic
confusability issue for the rendered text. That kind of problem is actually
a common issue in many scripts, and cannot really be considered model-breaking.

6. All encoding of repertoires of characters for scripts in the Unicode
Standard are ultimately engineering constructs that seek a balance between
multiple criteria, and in some cases where there are multiple orthographies
supported by a script, there can be cross-cutting criteria that may
affect decisions about the "best" design for the script. The Newa script
seems to be such a case.

7. There are very strong political linguistic factors involved in this particular
case, involving community ethnic identity and other concerns. The atomic
encoding of the murmured resonants has become a specifically politicized
identification issue for the users of the script, and this kind of factor
cannot be ignored when deciding on the encoding. Failing to take such
expressed concerns into account when deciding on a character encoding has
the potential for blocking or delaying the encoding -- not a desirable
outcome for any of the stakeholders involved.

8. The position stating that murmured resonants *must* be represented by
conjunct sequences instead of atomic characters is viewed by the Nepal
Bhasa stakeholders as specifically forcing Nepal Bhasa into a "Sanskrit box". That
position is a non-starter for that community, for a variety of reasons.

9. Given the depth of feeling involved in #7 and #8, in my opinion, an
argument that appeals to Indic model purity for the Newa script doesn't
cut it, if the adaptations required to the script to cover the
additional encoding requirements do not actually cause implementation
problems. I contend that in this case they do not, because they simply
add 6 more atomic consonants to the encoding -- atomic consonants which
can simply be excluded from further conjunct formation rules, and which
thus behave essentially no differently than other simple letters in
the script.

10. Encoding the 6 murmured resonants atomically in Newa provides a distinct
advantage for collation for the script. Their collation is irrelevant
to Sanskrit data. But for Nepal Bhasa, they simply get their primary
weight from their alphabetic order default in the charts, which means
that even non-tailored collations will do a better job. And having them
as atomically encoded consonants eliminates the need for adding
contractions of conjunct sequences <ha, virama, ma>, etc., to the
tailorings for Nepal Bhasa.

------------------------------------------------------------------------

Scenarios and Other Observations

Ultimately, deciding between the two approaches on the table -- to either
encode a set of murmured resonants as atomic characters or to require
representation of them with conjunct sequences -- should involve walking
through the relevant scenarios for input, text representation, and
rendering in detail, for both Sanskrit and Nepal Bhasa language data.
I cannot work through all of the permutations exhaustively here.
However, a few other observations come to mind.

1. The written orthography of Nepal Bhasa in the *Devanagari* script
already involves significant departures from the normal sequences
for syllables that one would expect for Sanskrit data. These innovations
for Nepal Bhasa can be inferred by examining the syllable chart
in part 3 of the document "Towards a Consensus Encoding of Newa,"
where the innovations for the representation of syllables involving
long diphthongs are mirrored in both Devanagari and in the Newa script.
Given those non-Sanskrit innovations in the writing system(s), having
additional consonants encoded atomically in Newa for the murmured resonants
is not actually that much of a stretch.

2. The claim that atomic encoding of murmured resonants breaks
the model, and that they *must* be encoded as Sanskrit-like
conjunct sequences because of their formal appearance, essentially
amounts to a claim that the same Nepal Bhasa content *must*
be represented with identically analogous sequences in both
the Devanagari script and the Newa script. The argument goes that because 
Devanagari does not have atomically encoded characters for conjuncts involving
sequences of ha with other consonants, therefore Newa (or other
Indic scripts) also cannot have them. But I consider this a fallacious
argument -- it ultimately devolves to an assertion that the
transliteration relationship between Nepal Bhasa written in
Devangari and Nepal Bhasa written in Newa must be strictly one-to-one.
But that kind of constraint on transliteration between scripts is
not something we would enforce between *other* scripts -- notably
between the Latin script and Newa, for example. (Again, see the
Background Analysis section below.)

========================================================================

Background Analysis

[Extracted from email discussion thread, October 14, 2014.]

This is really not a whole lot different than the typical kinds of graphological
problems which arise when a language of a very different type
and phonology adopts a well-established writing system that
has a long history of adaptation to a different kind of language.
See, e.g., the Japanese adaptation of Chinese writing, or the
extension of Arabic to (try to) work for languages of West Africa
(with tones and prenasalized consonants, etc., etc.).

For Newar, we have the problem of how to represent the murmured
nasals and other liquids. Without firsthand knowledge of Newar, but
judging by reading between the lines of what Kashinath Tamot wrote,
there are 4 murmured phonemes which need to be distinctly represented
here.

The *phonological* units are:

/m̤/, /n̤/, /ŋ̤/, /r̤ ~ l̤/

Where I won’t worry about the precise articulation of the “r”,
nor do I really care about the exact conditions of the alternations
for the “r” or “l”. I’m using the two dot below IPA murmur diacritic.
You could just as well indicate this with a modifier letter:

/mʱ/, /nʱ/, /ŋʱ/, /rʱ ~ lʱ/

Or, depending on your analysis and possibly the examination of detailed
acoustic date, might use the modifier letter prepended, just as well:

/ʱm/, /ʱn/, /ʱŋ/, /ʱr ~ ʱl/

So phonologically, that is what people want to represent.
O.k., so *next* we deal with how this gets represented in
Latin transliterations.

Again, judging from Tamot’s discussion, it appears that in older materials,
these are represented by *5* digraphs (or trigraphs), Sanskrit rank order:

hng, hn, hm, hr, hl

And that the SIL linguists turned those around and used the h after
the consonant:

ngh, nh, mh, rh, lh

Presumably that is part of a practical Latin orthography which just treats
the “h” as parallel to the “h” for the breathy voiced gh, dh, bh, for
consistency.

And then you also have the Indologist convention of writing the velar
nasal as ṅ, so you have alternative representations using that with
an h on either side.

O.k., now the next problem is how these are represented in the orthography
of a Brahmi-derived script, for which there aren’t going to be any
atomic graphemic units historically availably, unlike the graphemic units for gh, dh, bh.

So what do people do? They innovate using the mechanisms they have,
and end up doing the combinatory equivalent of the 3 different
phonological transcriptions above.

1. Write with notionally atomic conjunct forms.
2. Write with explicit sequences of <C + H>, which may render as a ligated conjunct.
3. Write with explicit sequences of <H + C>, which may render as a ligated conjunct.

And I’m guessing that this is then also informed by what Devanagari supports
for these kinds of special conjuncts, in particular:

hma, hya, hla, hva, … with the ha as the C1 of the sequences.

Looking back through Anshuman’s proposal carefully, he has a complete
set of conjuncts for all of these – actually for 7, since he includes ny and ṇ, as well.

Apparently, the preferred current forms use the <H + C> ligated conjuncts, but
*interpret* them as equivalent to mh, nh, etc., perhaps because of the inconsistent
transliteration/Romanization conventions. This is part of why Anshuman is
cautioning in p. 6 about fonts having to render two sequences as the same
conjunct.

If you look at the 5 relevant forms in the bottom row of Figure 28
(p. 48) of L2/12-003r, they are using the <H+C> ligated conjuncts in Newar,
but then transliterate them as <C+H> (mostly with unligated half-form renderings)
in Devanagari and as ngha, nha, mha, rha, lha in the Romanization.

As a further fly in the ointment here, note that that same online chart for
Newar has separate entries for ksa, tra, and gyan in the row just above
the murmured resonants, but nobody is holding forth for having to
represent *those* with atomic consonant letters, presumably!

I think we are in a transitional orthographic situation, where a community
of modern users have decided that the “letters” for the murmured resonants
really need to match their phonological status as unit phonemes. In other
words, there is a very strong feeling here that the aksaras must contain a
single consonant unit, even if the graphical form ultimately is derived from <H+C>
ligatures in the history of the script. Note that their claim on atomicity is
stronger than for ksa, tra, and gyan, as presumably in those cases, the
phonological analysis is going to support a treatment as a sequence of
phonemes, and the aberrant aspect is the visual unanalyzability of
the written forms.

It seems to me that for Prachalit, there would be no particular need to
posit atomic ngha, nha, mha, rha, and lha – you could get by as
Anshuman recommends just using whichever sequence appears to
underlie the relevant <H+C> or <C+H> combination.

But for application to modern writing of Newar, I tend to agree that
the case for independent letterhood is pretty strong.

And I think you end up with potential problems, whichever direction you
take.

1. Non-atomic.

Force use of sequences in all cases. The problem here comes from the
issue noted on p. 6 of Anshuman’s document. Formally, the preferred
ligated conjuncts are <HA, virama, NGA>, etc., but they may then
be interpreted as equivalent to forms written in Devanagari as
<NGA, virama, HA> and Romanized as ngha. And formally the
<NGA, virama, HA> ligature in the Newar script would look different
than what people expect for modern writing. That can lead to
representation and interpretation conundrums. It could lead to
confusion in input methods, perhaps.

2. Atomic.

Code 5 atomic units. The problem here comes from having
alternative, non-canonically equivalent spellings for the “same thing”.
On the other hand, this solution has a lot going for it, in that it
gives you a well-behaved atomic encoding for the 5 entities
(4 phonemes), which should work reliably even while you end
up arguing in text about how other sequences of <HA, virama, NGA>
versus <NGA, virama, HA> should be displayed or Romanized.

I suspect that the second scenario would turn out to be
somewhat more robust for users. It is also more tractable
to collapse together different distinct representations for
the purposes of comparison and searching than it would be
to try to disentangle possibly overlapping representations
that might or might not be treated as “the same”, which might
be where you’d end up with the first scenario.

The second scenario also has the obvious advantage that it
is the explicit preference of a vocal user community. That
doesn’t always mean it is the best choice, technically. But
in this case, I don’t think the non-atomic approach is a
completely clean and obviously better solution, given the
actual nature of the innovations in the modern script
usage.