L2/06-092

Date: March 24, 2006
Source: Ken Whistler
Title: Discussion of AA and TALL AA Disunification for Myanmar


Reference: L2/06-077R, Proposal to encode seven additional Myanmar
  characters in the UCS (= WG2 N3043R)
  

During discussion on the unicore list of L2/06-077R, I responded
to a contribution by Michael Everson as follows. (lightly edited)

In particular, I think the UTC will benefit from the explicit
analysis I provide below regarding the alternative approaches
that could be taken regarding AA and TALL AA for Myanmar, as
well as the list of advantages and disadvantages for each approach.

================= edited email of 3/24/06 follows =================

>> 
>> Scenario 1: AA and TALL AA are distinct characters. Karen and Mon 
>> users use TALL AA all the time. Burmese users select AA or TALL AA as 
>> required. No problems. WYSIWYG. Burmese spell-checkers would want to 
>> flag AA in some contexts as a spelling error.
>> 
>> Scenario 2: AA and TALL AA are distinct characters. Karen and Mon 
>> users use TALL AA all the time. Burmese users use AA all the time but 
>> sometimes it looks like TALL AA. OCR software is confused because you 
>> can't always tell what language a word is in. IDN fails for minority 
>> languages because Burmese will trump them, and TALL AA will be banned 
>> to prevent spoofing.


O.k., that is part of the way to clarifying the issues.


>>> >As others have pointed out, there are problems with this, but 
>>> >Unicode already has thousands of duplicate characters.
>
>> 
>> The actual problems are that you would NEVER know by looking at a 
>> piece of Burmese-script text how it was encoded. That is unacceptable.


That is a gross overgeneralization.

First of all, in Lee's suggested scenario, the ambiguity would
only occur in the contexts where Burmese uses a context-sensitive
rule to render the -aa as a tall-aa glyph. In contexts where
Burmese uses the short-aa glyph, there would be no ambiguity.

Second, your claim amounts to a much, much stronger claim about
character encoding than the standard actually makes. In effect,
you are claiming that under no circumstances can two distinct
character sequences render the same -- otherwise, the representation
of that text would not be unambiguous (without additional
information) based on mere visual presentation. That would be
"unacceptable", by your reckoning.

However, we *know* that there are myriad examples in the standard
in which we cannot know, from rendered appearance alone, how
the text is encoded.

Ideally, the standard has tried to create normalization forms that
eliminate such distinctions in normalized text. But even so,
there are many instances of identical rendering which do *not*
normalize to a single string. Some of these occur in Indic
scripts (independent vowels as units versus independent vowels
+ matras; alternate ordering of multiple vowels, including
issues in Myanmar). Others occur in other contexts. Thus, for example,
I have no guarantee whatsoever that U+FEA1 ARABIC LETTER HAH ISOLATED
FORM and U+062D ARABIC LETTER HAH will render distinctly, so that
I could tell from the visual form which character was in a backing
store. These are not canonically equivalent, so normalization isn't
going to save my butt here. I can only fall back on conventions
for text representation that tell me that ordinary Arabic text
should use U+062D and not U+FEA1. Worse, some Arabic letter
allographs of characters specifically encoded for minority
languages (or even majority languages -- just other than Arabic)
may have forms in certain contexts that are visually non-distinct
from *other* Arabic characters in those contexts. That is
no different from the situation that Lee has suggested as an
alternative approach for the Myanmar AA/TALL AA issue.

Now rampant visual confusibility is certainly a bad thing.
Normalization is in place in part to provide an algorithmic
answer to the worst instances of it. And protocols may well
put in place constraints on the use of various characters
because of their confusability. But you are way overboard here
in attempting to make a case for a particular disunification
based on a general principle which *isn't* an absolute principle
in the first place.

And contra your doomsday IDN scenarios, you should note that
nobody has advocated (to my knowledge) ruling out bunches
of Arabic letters needed for minority languages simply because
there is overlap with other Arabic letters in their allographic
forms.


>> I think we need to talk this one out. I have suggested that it is 
>> necessary to disunify TALL AA from AA in order to support all of the 
>> languages we are discussing.


And I (and I believe Lee, as well) think that is manifestly incorrect.

It might be *easier* to implement various processes using one model
or the other. But necessity has yet to be demonstrated.


>> If that's the case,


It isn't, I maintain. And if it isn't, then none of your
conclusions follow:


>> then even conformant 
>> existing Burmese text will have to be transcoded. There seems to be 
>> less conformant text than non-conformant, so it isn't clear how 
>> important that is. But if even conformant Burmese text will have to 
>> be transcoded, then we are much freer to implement the solution 
>> proposed.


O.k., alternative analysis time.

In the following analysis, I am going to adapt the following
conventions, to make things simpler to draw.

a : the character AA (currently encoded as U+102C MYANMAR VOWEL SIGN AA)

A : the character TALL-AA (not currently encoded, but proposed to be)

{a} : the nominal AA glyph, displayed in cell U+102C in TUS 4.0, p. 524

{A} : the tall AA glyph, shown in Everson's document, WG2 N3043R


The visible facts of the writing systems are:

Burmese:      x{a}  y{A}

S'gaw Karen:  x{A}  y{A}

In other words, for Burmese, in some contexts x__ we see {a}, and
in other contexts y__ we see {A}, whereas for S'gaw Karen, in
those same contexts, we always see {A}. (For now, I don't care
exactly what those contexts are, and I think we can all stipulate
that those are the visible facts about the writing systems for
these two languages.)


The phonological facts, ignoring any fine details, are:

Burmese:      ({a}, {A}) --> /a:/

S'gaw Karen:  {A}        --> /a:/

which is a jargonistic way of putting that we are basically
dealing with a single vowel, in either case -- the same
vowel structurally, in fact. And in S'gaw Karen the representation
is one-to-one, but in Burmese, the users of the writing system
are used to having two forms, {a} or {A}, representing the
same vowel unit (the same phoneme), depending on the written
context.


Moving on to character encoding considerations, I will outline
first the current situation (Scenario 0), and then move on
to Scenario 1 (as advocated by Michael Everson), and Scenario 2
(as mentioned by Lee Collins).


Scenario 0

This is the situation with the encoding as it stands.

              Encoding   Rendering rule   Reading rule

Burmese:      xa  ya     a --> {a}/x__    {a} --> a
                         a --> {A}/y__    {A} --> a

S'gaw Karen:  xa  ya     a --> {A}        {A} --> a


Scenario 1

This results from Everson's advocated solution, disunifying
based on glyph shape.

              Encoding   Rendering rule   Reading rule

Burmese:      xa  yA     a --> {a}        {a} --> a
                         A --> {A}        {A} --> A

S'gaw Karen:  xA  yA     A --> {A}        {A} --> A


Scenario 2

This results from Collin's suggested alternative, introducing
a separate character for LONG-AA for S'gaw Karen.

              Encoding   Rendering rule   Reading rule

Burmese:      xa  ya     a --> {a}/x__    {a} --> a
                         a --> {A}/y__    {A} --> a/Burmese

S'gaw Karen:  xA  yA     A --> {A}        {A} --> A/S'gaw Karen


O.k., now if you are with me to this point, I think this
*finally* manages to be an explicit statement of what
the current situation is and what two possible alternative
approaches would entail.

Now let me try to reinterpret and extend the claims that Michael
has been making about the drawbacks of Scenario 0 and the
"necessity" of Scenario 1.

Scenario 0

Advantages

   It is the current situation, and requires no change to
   the standard.
   
   It *can* represent both Burmese and S'gaw Karen text
   correctly.
   
   It is completely unambiguous -- the reading rules always
   result in identifying the correct character.
   
   The single structural vowel /a:/ is represented consistently
   with a single character, making searching, sorting, and
   similar processing marginally easier than in the other
   alternatives.
   
   S'gaw Karen can be represented with the "Burmese" character
   and vice versa -- in other words, there is no need to worry
   about which character has been used for which language.
   This is likely to result in fewer spelling errors in data,
   for example.
   
Disadvantages

   The rendering rules involve a language-specific difference
   for the single character encoded. This makes it very difficult
   to implement single font support for Burmese *and* minority
   languages, without positing smart fonts and language tagged
   text.

   Burmese typists are already used to the concept of AA and
   LONG-AA being separate keys and "letters" from the point of
   view of input, and may find it confusing to adapt to a
   system that assumes both represent the same "character" in
   the text.
   
   The Irish national body, Myanmar IT experts, and some numbers
   of others are deeply unhappy with the current encoding,
   and express strong opinions that it is inadequate.
   

Scenario 1

Advantages

   It is completely WYSIWIG, because it encodes the glyph forms
   as characters.
   
   It *can* represent both Burmese and S'gaw Karen text
   correctly.
   
   It is completely unambiguous -- the reading rules always
   result in identifying the correct character.
   
   It enables simple support of a monofont solution for Burmese
   and minority languages of Myanmar, without posting smart
   fonts and language tagged text. (for this issue, at least)
   
   It accords with already established proclivities of Myanmar
   keyboard typists.
   
   Choosing it would make the Irish national body, Myanmar IT
   experts, and some numbers of others happy.
   
Disadvantages

   It isn't the current standard, which means we all have to fight
   about it.
   
   It invalidates the representation of any Burmese text which
   is currently conformant to the standard, requiring the transcoding
   of an indefinite (and probably indeterminable) amount of existing
   data if it is to be conformant after the change proposed.
   
   It disunifies the representation of the /a:/ vowel in Burmese,
   requiring adjustment of searching and sorting algorithms, etc.,
   to handle the two characters as equivalent.
   
   It disunifies the representation of the /a:/ vowel between
   Burmese and some minority languages, spelling text glyphically
   instead of by logical (phonological) units. 
   
   It changes the rendering rule for Burmese, in particular, which
   will require revising any existing fonts or engines. This
   disadvantage is partially offset by the fact that the resulting
   fonts and/or engines required are *simpler* that what
   Scenario 0 requires.
   
   
Scenario 2

Advantages

   For Burmese, it is identical to the current standard, which means
   we don't have to fight about it.
   
   It *can* represent both Burmese and S'gaw Karen text
   correctly.
   
   Because it is identical to the current standard for Burmese, it
   means that any existing conformant text need not be transcoded,
   nor do any existing conformant fonts or rendering engines need
   to be reworked *for Burmese*.
   
   The addition for S'gaw Karen requires no fancy rendering. It
   is simply one-to-one, as for Scenario 1.
   
   It enables simple *extension* of any existing solution for
   Burmese to also support S'gaw Karen (for the TALL-AA, at least, which
   is all I'm evaluating here). This is offset of course by the
   drawback that the overall rendering is no simpler than that
   already required for Burmese in the current encoding, so you
   still need to support contextual shaping for AA.
   
Disadvantages

   It introduces a new character, which we have to fight about.
   
   Because it disunifies the /a:/ vowel in S'gaw Karen from that
   in Burmese, introducing possible opportunities for spelling
   errors and other processing issues.
   
   It means typing conventions for Burmese and for minority
   languages will end up being distinct.
   
   I results in graphic ambiguity, because a reading rule needs
   to distinguish between a TALL-AA glyph in a Burmese context
   (=U+102C, contextually shaped) and a TALL-AA glyph in a
   S'gaw Karen context (= some newly encoded character).
   
   The visual ambiguity raises the specter of security concerns,
   whether justified or not, which if focussed upon could be
   used to disqualify the scenario as an option. 
   
   
O.k., I have discussed it, and I think made a start towards a much,
much more explicit list of advantages and disadvantages to the
various credible alternatives.

Now I would like the list to focus on trying to weight the
advantages and disadvantages (and to qualify them or discover
and specify others) rationally.

And I would like the discussion to veer away from the manifestly
unproductive pattern it has been in of repetitive pooh-poohing
of unclearly specified problems and exaggerated and unsubstantiated
claims being made in the absence of analysis.

Also, I would appreciate it if the *more* difficult issue of
encoding the four medials, and the issue of encoding an explicit
asat were to be laid out in comparable, *explicit* detail for
the UTC to evaluate.


.