L2/03-027

Title: Response re UTC Agenda Item: Scope of Enclosing Marks
Date:  January 30, 2003
Source: Ken Whistler

> Michael Everson wrote:
> > At 18:00 -0800 2003-01-29, Mark Davis wrote:
> >> 1. Given the sequence "1" + grapheme_joiner + "2" + enclosing_circle, 
> >> should the circle enclose the three previous characters or only the "2"?
> > 
> > All three.
> > 
> >> 2. Given the sequence KA + VIRAMA + DDHA, should the circle enclose the
> >> three previous characters or only the DDHA?
> > 
> > Definitely all three. ...

And Markus asked:

> 
> Does this suggest that a grapheme cluster should be defined as
> 
> gc1 := (<hangul syllable> | <base>) (Mn-<link>)*
> grapheme cluster := gc1 [(<link> | VIRAMA) [gc1]]
> 
> and then Me encircles a grapheme cluster?
> 
> Or
> 
> grapheme cluster := (<hangul syllable> | <base>) Mn*
> 
> and then Me encircles
> 
> encirclable := grapheme cluster [(<link> | VIRAMA) [grapheme cluster]]

To which I would respond, definitely the latter.

The UTC already has been round and round on this, and deliberately
backed off the attempt at the more complex definition of
grapheme cluster, precisely because it runs afoul of so many
nasty script-specific and orthography-specific problems, and
was getting waaay more complicated that the value it was supposed
to provide could justify. In a way, it was becoming *anti*elucidative,
and would just have engendered more confusion if we were to come
out with such a definition.

The UTC has already decided on the definition:

grapheme cluster := (<hangul syllable> | <base>) Mn*

and that is what is in the current draft of UAX #29 (with a
short exception list added to Mn, to deal with canonical
equivalence issues).

I *don't* think we should reopen that question and start
revisiting all the attendant arguments about it.

What the editorial committee is seeking clarification about
is the behavior of enclosing combining marks.

Interpretation A 

This interpretation is that the behavior of enclosing combining
marks was tied to the discussion of grapheme clusters. We
knew that the intention of enclosing combining marks was to
enclose such constructs as stacks of base characters plus
nonspacing marks, or to enclose Korean syllables. During the
discussion of grapheme clusters, it was assumed that enclosing
combining marks would apply to grapheme clusters, since those
obvious cases were defined as grapheme clusters. But under this
interpretation, when the UTC backed off the complex definition
of grapheme clusters to the simpler, core definition, then
the behavior of enclosing combining marks, tied to grapheme
clusters, should be correspondingly narrowed as well.

Interpretation B

This interpretation basically assumes that enclosing combining
marks should apply to orthographic units (however defined).
In this case, changing the definition of grapheme cluster would
be irrelevant, and should not affect what we claim an enclosing
combining mark should surround. This, I believe, is the
interpretation that Michael Everson has advocated above.
Under this interpretation, then, the UTC would need to define
something like Markus' "encirclable" above, in order for us to
make a coherent recommendation in the text of the standard
as to what an enclosing combining mark actually applies to.

The two interpretations have different implications for the
immediate problem of what the editors include in the Unicode
Standard, Version 4.0 text. Under Interpretation A we would
remove two Devanagari examples inherited from the Unicode 3.2
documentation (and possibly substitute a less problematical
example in their stead). Under Interpretation B we would
leave the examples as is (and possibly try to find a place
to document whatever rule for "encirclable" the UTC decided
upon). But clearly making this decision on our own would have
exceeded the mandate of the editorial committee; it is a decision
with non-trivial technical consequences which the UTC should
decide.

My *personal* opinion here is that the UTC should go with
Interpretation A. This is, I believe, the far less problematical
option. It would not open the difficult issue of how, exactly
to define the "target of encircling", if it is to be something
more than the default grapheme cluster. It follows the past
precedent set by the committee in deciding to just encode
a bunch more encircled numbers for compatibility with JIS X 0213
and the DPRK standard and get on with our lives, rather than
insist that such characters had to be mapped to some as-yet-ill-
defined mechanism using the combining enclosing circle. And
it would not put the committee on record as requiring the
more complex analysis which Interpretation B would require for
"target of encircling" -- analysis that would be likely to
be ignored or be haphazardly applied in real implementations,
as it is running over the edge of what we can reasonably require
of conformant rendering of the Unicode Standard.

If we mandate 

encirclable := grapheme cluster [(<link> | VIRAMA) [grapheme cluster]]

that would immediately raise the question as to whether it should
not actually be:

encirclable := grapheme cluster ((<link> | VIRAMA) [grapheme cluster])?

that is, a grapheme cluster followed by zero *or more* instances
of linked grapheme clusters. This would follow directly from
Michael's argument that an Indic conjunct can't be broken apart
into pieces and encircled one element at a time -- but that
means that encirclable would have to apply to 3-part conjuncts
as well as to 2-part conjuncts, and so on.

And if

<1, CGJ, 2, combining-circle>

results in a circled '12', then why would not

<1, CGJ, 2, CGJ, 3, combining-circle>

result in a circled '123'?

For that matter, then, what about:

<1, CGJ, 2, CGJ, 3, CGJ, 4, CGJ, 5, CGJ, 6, CGJ, 7, CGJ, 8, combining-circle>

And you can see where this is heading. Any reasonable implementation is
going to blow chunks at some point. The logical productivity of the
concept in general quickly runs into the realistic limits of
complexity that any generic implementation can be expected to handle
in terms of surrounding arbitrarily defined chunks of text with
an appropriately scaled glyph of what might be an enclosing mark of
somewhat arbitrary shape (cf. U+20E0 combining enclosing circle backslash
and U+20E3 combining enclosing keycap).

I think that direction is quickly running off the tracks of what
a *plain text* standard should be mandating in terms of rendering
requirements. And that is why any such recommendation is likely to
be honored in the breach, rather than being widely implemented.
The Unicode Standard already mandates a dizzying array of
complexity -- I don't think adding this is an appropriate addition
to be making.

Instead, "encirclable" is precisely the kind of thing that is
better handled via markup systems, which have much better
means for marking exact scope of applicability of some effect
like encircling of text than we have in Unicode plain text.
Leaving such effects to higher-level protocols also gives
implementations better choices regarding what they will and will
not have to support for this kind of rendering, and may result
in more consistent behavior for the then-limited scope of
applicability of enclosing marks which would still remain
defined in the standard.

I urge the UTC to go with Interpretation A and not to end up
chasing after another will o' the wisp of rules for textual
presentation that should be left up to other protocols.

--Ken