RE: Sequences of combining characters (from Romanization of Cyril lic and Byzantine legal codes)

From: Peter_Constable@sil.org
Date: Wed Sep 18 2002 - 09:24:51 EDT


On 09/18/2002 03:48:41 AM Marco Cimarosti wrote:

>Using the COMBINING DOUBLE INVERTED BREVE doesn't make things much better:
>
> t U+0361 s U+0307
>
>Still, <U+0361> only applies to <t>,

I'm not sure why you say that. < t, 0361, s > should render with the
inverted breve spanning both the t and the s. And 0361 is definitely
preferable (except perhaps certain FE legacy contexts) to FE20 + FE21.

>and <U+0307> only applies to <s>.

Here's where the ambiguity lies: what is the interaction between a
double-wide combining mark and a single-wide combining mark? Given the
sequence < t, 0361, s, 0307 >, should the dot appear over the center of the
inverted breve, over the right side of the inverted breve, or should the
dot be over the s with the inverted breve above that? While section 2.6
doesn't shed any light on this, section 7.9 does: "double diacritics always
bind more loosely than all other non-spacing marks except U+0345 iota
subscript... In rendering, the double diacritic will float above other
diacritics (excluding surrounding diacritics)..." Thus, given the sequence
< t, 0361, s, 0307 >, the dot will appear over the s, and the inverted
breve will span the t and the s-dot. (Figure 7-6 illustrates the
principle.)

>Perhaps, a viable approach could be using the COMBINING GRAPHEME JOINER
(to
>turn <ts> into a single 'grapheme'), and then use regular combining marks
>(as opposed to the "double" clones):
>
> t U+034F s U+0311 U+0307

No, that is not an appropriate solution. From Unicode 3.2 (UTR#28):

<quote clause=3.9>
Formally speaking, combining marks apply to the preceding grapheme cluster.
In most cases, this is the same as applying to the preceding base
character. However, in two circumstances there is a difference:

      Hangul syllables
      Enclosing combining marks

...where elements are linked by a Grapheme_Link or combining grapheme
joiner, non-enclosing combining marks only apply to the last base
character.
</quote>

Thus, given the sequence < t, 034F, s, 0311, 0307 >, the last two combining
marks apply only to the s. Note that if this were otherwise, there would be
an issue (yet another) in relation to equivalence and normalisation: the
sequences < t, 034F, s, 0311 > and < t, 0361, s > would be identical in
appearance (and used to mean the same thing), but they would not be
canonically equivalent and would be distinguished after normalisation.

Thus, I don't think there is any current solution for the ts + inverted
breve + dot above used in LOC's non-Slavic Romanisations. I don't know if
LOC is needing a solution for this. If so, the only thing I can think of
would be to propose a new character, either COMBINING DOUBLE INVERTED BREVE
WITH DOT ABOVE or COMBINING DOUBLE DOT ABOVE (with an unfortunately
ambiguous name). In the latter case, we'd have to define new behaviour,
i.e. for combininations of double-width diacritics, e.g. < t, 0361,
COMBINING DOUBLE DOT ABOVE, s >.

BTW,

>William Overington wrote:

>> appear possible that the way that the ts ligature with a dot above for

This is not what typographers would call a ligature.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Wed Sep 18 2002 - 10:36:23 EDT