RE: Sequences of combining characters (from Romanization of Cyril lic and Byzantine legal codes)

Date: Wed Sep 18 2002 - 09:24:51 EDT

On 09/18/2002 03:48:41 AM Marco Cimarosti wrote:

>Using the COMBINING DOUBLE INVERTED BREVE doesn't make things much better:
> t U+0361 s U+0307
>Still, <U+0361> only applies to <t>,

I'm not sure why you say that. < t, 0361, s > should render with the
inverted breve spanning both the t and the s. And 0361 is definitely
preferable (except perhaps certain FE legacy contexts) to FE20 + FE21.

>and <U+0307> only applies to <s>.

Here's where the ambiguity lies: what is the interaction between a
double-wide combining mark and a single-wide combining mark? Given the
sequence < t, 0361, s, 0307 >, should the dot appear over the center of the
inverted breve, over the right side of the inverted breve, or should the
dot be over the s with the inverted breve above that? While section 2.6
doesn't shed any light on this, section 7.9 does: "double diacritics always
bind more loosely than all other non-spacing marks except U+0345 iota
subscript... In rendering, the double diacritic will float above other
diacritics (excluding surrounding diacritics)..." Thus, given the sequence
< t, 0361, s, 0307 >, the dot will appear over the s, and the inverted
breve will span the t and the s-dot. (Figure 7-6 illustrates the

>Perhaps, a viable approach could be using the COMBINING GRAPHEME JOINER
>turn <ts> into a single 'grapheme'), and then use regular combining marks
>(as opposed to the "double" clones):
> t U+034F s U+0311 U+0307

No, that is not an appropriate solution. From Unicode 3.2 (UTR#28):

<quote clause=3.9>
Formally speaking, combining marks apply to the preceding grapheme cluster.
In most cases, this is the same as applying to the preceding base
character. However, in two circumstances there is a difference:

      Hangul syllables
      Enclosing combining marks

...where elements are linked by a Grapheme_Link or combining grapheme
joiner, non-enclosing combining marks only apply to the last base

Thus, given the sequence < t, 034F, s, 0311, 0307 >, the last two combining
marks apply only to the s. Note that if this were otherwise, there would be
an issue (yet another) in relation to equivalence and normalisation: the
sequences < t, 034F, s, 0311 > and < t, 0361, s > would be identical in
appearance (and used to mean the same thing), but they would not be
canonically equivalent and would be distinguished after normalisation.

Thus, I don't think there is any current solution for the ts + inverted
breve + dot above used in LOC's non-Slavic Romanisations. I don't know if
LOC is needing a solution for this. If so, the only thing I can think of
would be to propose a new character, either COMBINING DOUBLE INVERTED BREVE
ambiguous name). In the latter case, we'd have to define new behaviour,
i.e. for combininations of double-width diacritics, e.g. < t, 0361,


>William Overington wrote:

>> appear possible that the way that the ts ligature with a dot above for

This is not what typographers would call a ligature.

- Peter

Peter Constable

