L2/03-027 Title: Response re UTC Agenda Item: Scope of Enclosing Marks Date: January 30, 2003 Source: Ken Whistler > Michael Everson wrote: > > At 18:00 -0800 2003-01-29, Mark Davis wrote: > >> 1. Given the sequence "1" + grapheme_joiner + "2" + enclosing_circle, > >> should the circle enclose the three previous characters or only the "2"? > > > > All three. > > > >> 2. Given the sequence KA + VIRAMA + DDHA, should the circle enclose the > >> three previous characters or only the DDHA? > > > > Definitely all three. ... And Markus asked: > > Does this suggest that a grapheme cluster should be defined as > > gc1 := ( | ) (Mn-)* > grapheme cluster := gc1 [( | VIRAMA) [gc1]] > > and then Me encircles a grapheme cluster? > > Or > > grapheme cluster := ( | ) Mn* > > and then Me encircles > > encirclable := grapheme cluster [( | VIRAMA) [grapheme cluster]] To which I would respond, definitely the latter. The UTC already has been round and round on this, and deliberately backed off the attempt at the more complex definition of grapheme cluster, precisely because it runs afoul of so many nasty script-specific and orthography-specific problems, and was getting waaay more complicated that the value it was supposed to provide could justify. In a way, it was becoming *anti*elucidative, and would just have engendered more confusion if we were to come out with such a definition. The UTC has already decided on the definition: grapheme cluster := ( | ) Mn* and that is what is in the current draft of UAX #29 (with a short exception list added to Mn, to deal with canonical equivalence issues). I *don't* think we should reopen that question and start revisiting all the attendant arguments about it. What the editorial committee is seeking clarification about is the behavior of enclosing combining marks. Interpretation A This interpretation is that the behavior of enclosing combining marks was tied to the discussion of grapheme clusters. We knew that the intention of enclosing combining marks was to enclose such constructs as stacks of base characters plus nonspacing marks, or to enclose Korean syllables. During the discussion of grapheme clusters, it was assumed that enclosing combining marks would apply to grapheme clusters, since those obvious cases were defined as grapheme clusters. But under this interpretation, when the UTC backed off the complex definition of grapheme clusters to the simpler, core definition, then the behavior of enclosing combining marks, tied to grapheme clusters, should be correspondingly narrowed as well. Interpretation B This interpretation basically assumes that enclosing combining marks should apply to orthographic units (however defined). In this case, changing the definition of grapheme cluster would be irrelevant, and should not affect what we claim an enclosing combining mark should surround. This, I believe, is the interpretation that Michael Everson has advocated above. Under this interpretation, then, the UTC would need to define something like Markus' "encirclable" above, in order for us to make a coherent recommendation in the text of the standard as to what an enclosing combining mark actually applies to. The two interpretations have different implications for the immediate problem of what the editors include in the Unicode Standard, Version 4.0 text. Under Interpretation A we would remove two Devanagari examples inherited from the Unicode 3.2 documentation (and possibly substitute a less problematical example in their stead). Under Interpretation B we would leave the examples as is (and possibly try to find a place to document whatever rule for "encirclable" the UTC decided upon). But clearly making this decision on our own would have exceeded the mandate of the editorial committee; it is a decision with non-trivial technical consequences which the UTC should decide. My *personal* opinion here is that the UTC should go with Interpretation A. This is, I believe, the far less problematical option. It would not open the difficult issue of how, exactly to define the "target of encircling", if it is to be something more than the default grapheme cluster. It follows the past precedent set by the committee in deciding to just encode a bunch more encircled numbers for compatibility with JIS X 0213 and the DPRK standard and get on with our lives, rather than insist that such characters had to be mapped to some as-yet-ill- defined mechanism using the combining enclosing circle. And it would not put the committee on record as requiring the more complex analysis which Interpretation B would require for "target of encircling" -- analysis that would be likely to be ignored or be haphazardly applied in real implementations, as it is running over the edge of what we can reasonably require of conformant rendering of the Unicode Standard. If we mandate encirclable := grapheme cluster [( | VIRAMA) [grapheme cluster]] that would immediately raise the question as to whether it should not actually be: encirclable := grapheme cluster (( | VIRAMA) [grapheme cluster])? that is, a grapheme cluster followed by zero *or more* instances of linked grapheme clusters. This would follow directly from Michael's argument that an Indic conjunct can't be broken apart into pieces and encircled one element at a time -- but that means that encirclable would have to apply to 3-part conjuncts as well as to 2-part conjuncts, and so on. And if <1, CGJ, 2, combining-circle> results in a circled '12', then why would not <1, CGJ, 2, CGJ, 3, combining-circle> result in a circled '123'? For that matter, then, what about: <1, CGJ, 2, CGJ, 3, CGJ, 4, CGJ, 5, CGJ, 6, CGJ, 7, CGJ, 8, combining-circle> And you can see where this is heading. Any reasonable implementation is going to blow chunks at some point. The logical productivity of the concept in general quickly runs into the realistic limits of complexity that any generic implementation can be expected to handle in terms of surrounding arbitrarily defined chunks of text with an appropriately scaled glyph of what might be an enclosing mark of somewhat arbitrary shape (cf. U+20E0 combining enclosing circle backslash and U+20E3 combining enclosing keycap). I think that direction is quickly running off the tracks of what a *plain text* standard should be mandating in terms of rendering requirements. And that is why any such recommendation is likely to be honored in the breach, rather than being widely implemented. The Unicode Standard already mandates a dizzying array of complexity -- I don't think adding this is an appropriate addition to be making. Instead, "encirclable" is precisely the kind of thing that is better handled via markup systems, which have much better means for marking exact scope of applicability of some effect like encircling of text than we have in Unicode plain text. Leaving such effects to higher-level protocols also gives implementations better choices regarding what they will and will not have to support for this kind of rendering, and may result in more consistent behavior for the then-limited scope of applicability of enclosing marks which would still remain defined in the standard. I urge the UTC to go with Interpretation A and not to end up chasing after another will o' the wisp of rules for textual presentation that should be left up to other protocols. --Ken