L2/06-092 Date: March 24, 2006 Source: Ken Whistler Title: Discussion of AA and TALL AA Disunification for Myanmar Reference: L2/06-077R, Proposal to encode seven additional Myanmar characters in the UCS (= WG2 N3043R) During discussion on the unicore list of L2/06-077R, I responded to a contribution by Michael Everson as follows. (lightly edited) In particular, I think the UTC will benefit from the explicit analysis I provide below regarding the alternative approaches that could be taken regarding AA and TALL AA for Myanmar, as well as the list of advantages and disadvantages for each approach. ================= edited email of 3/24/06 follows ================= >> >> Scenario 1: AA and TALL AA are distinct characters. Karen and Mon >> users use TALL AA all the time. Burmese users select AA or TALL AA as >> required. No problems. WYSIWYG. Burmese spell-checkers would want to >> flag AA in some contexts as a spelling error. >> >> Scenario 2: AA and TALL AA are distinct characters. Karen and Mon >> users use TALL AA all the time. Burmese users use AA all the time but >> sometimes it looks like TALL AA. OCR software is confused because you >> can't always tell what language a word is in. IDN fails for minority >> languages because Burmese will trump them, and TALL AA will be banned >> to prevent spoofing. O.k., that is part of the way to clarifying the issues. >>> >As others have pointed out, there are problems with this, but >>> >Unicode already has thousands of duplicate characters. > >> >> The actual problems are that you would NEVER know by looking at a >> piece of Burmese-script text how it was encoded. That is unacceptable. That is a gross overgeneralization. First of all, in Lee's suggested scenario, the ambiguity would only occur in the contexts where Burmese uses a context-sensitive rule to render the -aa as a tall-aa glyph. In contexts where Burmese uses the short-aa glyph, there would be no ambiguity. Second, your claim amounts to a much, much stronger claim about character encoding than the standard actually makes. In effect, you are claiming that under no circumstances can two distinct character sequences render the same -- otherwise, the representation of that text would not be unambiguous (without additional information) based on mere visual presentation. That would be "unacceptable", by your reckoning. However, we *know* that there are myriad examples in the standard in which we cannot know, from rendered appearance alone, how the text is encoded. Ideally, the standard has tried to create normalization forms that eliminate such distinctions in normalized text. But even so, there are many instances of identical rendering which do *not* normalize to a single string. Some of these occur in Indic scripts (independent vowels as units versus independent vowels + matras; alternate ordering of multiple vowels, including issues in Myanmar). Others occur in other contexts. Thus, for example, I have no guarantee whatsoever that U+FEA1 ARABIC LETTER HAH ISOLATED FORM and U+062D ARABIC LETTER HAH will render distinctly, so that I could tell from the visual form which character was in a backing store. These are not canonically equivalent, so normalization isn't going to save my butt here. I can only fall back on conventions for text representation that tell me that ordinary Arabic text should use U+062D and not U+FEA1. Worse, some Arabic letter allographs of characters specifically encoded for minority languages (or even majority languages -- just other than Arabic) may have forms in certain contexts that are visually non-distinct from *other* Arabic characters in those contexts. That is no different from the situation that Lee has suggested as an alternative approach for the Myanmar AA/TALL AA issue. Now rampant visual confusibility is certainly a bad thing. Normalization is in place in part to provide an algorithmic answer to the worst instances of it. And protocols may well put in place constraints on the use of various characters because of their confusability. But you are way overboard here in attempting to make a case for a particular disunification based on a general principle which *isn't* an absolute principle in the first place. And contra your doomsday IDN scenarios, you should note that nobody has advocated (to my knowledge) ruling out bunches of Arabic letters needed for minority languages simply because there is overlap with other Arabic letters in their allographic forms. >> I think we need to talk this one out. I have suggested that it is >> necessary to disunify TALL AA from AA in order to support all of the >> languages we are discussing. And I (and I believe Lee, as well) think that is manifestly incorrect. It might be *easier* to implement various processes using one model or the other. But necessity has yet to be demonstrated. >> If that's the case, It isn't, I maintain. And if it isn't, then none of your conclusions follow: >> then even conformant >> existing Burmese text will have to be transcoded. There seems to be >> less conformant text than non-conformant, so it isn't clear how >> important that is. But if even conformant Burmese text will have to >> be transcoded, then we are much freer to implement the solution >> proposed. O.k., alternative analysis time. In the following analysis, I am going to adapt the following conventions, to make things simpler to draw. a : the character AA (currently encoded as U+102C MYANMAR VOWEL SIGN AA) A : the character TALL-AA (not currently encoded, but proposed to be) {a} : the nominal AA glyph, displayed in cell U+102C in TUS 4.0, p. 524 {A} : the tall AA glyph, shown in Everson's document, WG2 N3043R The visible facts of the writing systems are: Burmese: x{a} y{A} S'gaw Karen: x{A} y{A} In other words, for Burmese, in some contexts x__ we see {a}, and in other contexts y__ we see {A}, whereas for S'gaw Karen, in those same contexts, we always see {A}. (For now, I don't care exactly what those contexts are, and I think we can all stipulate that those are the visible facts about the writing systems for these two languages.) The phonological facts, ignoring any fine details, are: Burmese: ({a}, {A}) --> /a:/ S'gaw Karen: {A} --> /a:/ which is a jargonistic way of putting that we are basically dealing with a single vowel, in either case -- the same vowel structurally, in fact. And in S'gaw Karen the representation is one-to-one, but in Burmese, the users of the writing system are used to having two forms, {a} or {A}, representing the same vowel unit (the same phoneme), depending on the written context. Moving on to character encoding considerations, I will outline first the current situation (Scenario 0), and then move on to Scenario 1 (as advocated by Michael Everson), and Scenario 2 (as mentioned by Lee Collins). Scenario 0 This is the situation with the encoding as it stands. Encoding Rendering rule Reading rule Burmese: xa ya a --> {a}/x__ {a} --> a a --> {A}/y__ {A} --> a S'gaw Karen: xa ya a --> {A} {A} --> a Scenario 1 This results from Everson's advocated solution, disunifying based on glyph shape. Encoding Rendering rule Reading rule Burmese: xa yA a --> {a} {a} --> a A --> {A} {A} --> A S'gaw Karen: xA yA A --> {A} {A} --> A Scenario 2 This results from Collin's suggested alternative, introducing a separate character for LONG-AA for S'gaw Karen. Encoding Rendering rule Reading rule Burmese: xa ya a --> {a}/x__ {a} --> a a --> {A}/y__ {A} --> a/Burmese S'gaw Karen: xA yA A --> {A} {A} --> A/S'gaw Karen O.k., now if you are with me to this point, I think this *finally* manages to be an explicit statement of what the current situation is and what two possible alternative approaches would entail. Now let me try to reinterpret and extend the claims that Michael has been making about the drawbacks of Scenario 0 and the "necessity" of Scenario 1. Scenario 0 Advantages It is the current situation, and requires no change to the standard. It *can* represent both Burmese and S'gaw Karen text correctly. It is completely unambiguous -- the reading rules always result in identifying the correct character. The single structural vowel /a:/ is represented consistently with a single character, making searching, sorting, and similar processing marginally easier than in the other alternatives. S'gaw Karen can be represented with the "Burmese" character and vice versa -- in other words, there is no need to worry about which character has been used for which language. This is likely to result in fewer spelling errors in data, for example. Disadvantages The rendering rules involve a language-specific difference for the single character encoded. This makes it very difficult to implement single font support for Burmese *and* minority languages, without positing smart fonts and language tagged text. Burmese typists are already used to the concept of AA and LONG-AA being separate keys and "letters" from the point of view of input, and may find it confusing to adapt to a system that assumes both represent the same "character" in the text. The Irish national body, Myanmar IT experts, and some numbers of others are deeply unhappy with the current encoding, and express strong opinions that it is inadequate. Scenario 1 Advantages It is completely WYSIWIG, because it encodes the glyph forms as characters. It *can* represent both Burmese and S'gaw Karen text correctly. It is completely unambiguous -- the reading rules always result in identifying the correct character. It enables simple support of a monofont solution for Burmese and minority languages of Myanmar, without posting smart fonts and language tagged text. (for this issue, at least) It accords with already established proclivities of Myanmar keyboard typists. Choosing it would make the Irish national body, Myanmar IT experts, and some numbers of others happy. Disadvantages It isn't the current standard, which means we all have to fight about it. It invalidates the representation of any Burmese text which is currently conformant to the standard, requiring the transcoding of an indefinite (and probably indeterminable) amount of existing data if it is to be conformant after the change proposed. It disunifies the representation of the /a:/ vowel in Burmese, requiring adjustment of searching and sorting algorithms, etc., to handle the two characters as equivalent. It disunifies the representation of the /a:/ vowel between Burmese and some minority languages, spelling text glyphically instead of by logical (phonological) units. It changes the rendering rule for Burmese, in particular, which will require revising any existing fonts or engines. This disadvantage is partially offset by the fact that the resulting fonts and/or engines required are *simpler* that what Scenario 0 requires. Scenario 2 Advantages For Burmese, it is identical to the current standard, which means we don't have to fight about it. It *can* represent both Burmese and S'gaw Karen text correctly. Because it is identical to the current standard for Burmese, it means that any existing conformant text need not be transcoded, nor do any existing conformant fonts or rendering engines need to be reworked *for Burmese*. The addition for S'gaw Karen requires no fancy rendering. It is simply one-to-one, as for Scenario 1. It enables simple *extension* of any existing solution for Burmese to also support S'gaw Karen (for the TALL-AA, at least, which is all I'm evaluating here). This is offset of course by the drawback that the overall rendering is no simpler than that already required for Burmese in the current encoding, so you still need to support contextual shaping for AA. Disadvantages It introduces a new character, which we have to fight about. Because it disunifies the /a:/ vowel in S'gaw Karen from that in Burmese, introducing possible opportunities for spelling errors and other processing issues. It means typing conventions for Burmese and for minority languages will end up being distinct. I results in graphic ambiguity, because a reading rule needs to distinguish between a TALL-AA glyph in a Burmese context (=U+102C, contextually shaped) and a TALL-AA glyph in a S'gaw Karen context (= some newly encoded character). The visual ambiguity raises the specter of security concerns, whether justified or not, which if focussed upon could be used to disqualify the scenario as an option. O.k., I have discussed it, and I think made a start towards a much, much more explicit list of advantages and disadvantages to the various credible alternatives. Now I would like the list to focus on trying to weight the advantages and disadvantages (and to qualify them or discover and specify others) rationally. And I would like the discussion to veer away from the manifestly unproductive pattern it has been in of repetitive pooh-poohing of unclearly specified problems and exaggerated and unsubstantiated claims being made in the absence of analysis. Also, I would appreciate it if the *more* difficult issue of encoding the four medials, and the issue of encoding an explicit asat were to be laid out in comparable, *explicit* detail for the UTC to evaluate. .