Re: GRAPHEME JOINER vs. double diacritics

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jan 03 2002 - 18:48:33 EST


Eric Muller asked:

> Is it correct that the sequences U+x U+0360 U+y and U+x U+034F U+y
> U+0303 should display the same? Would it be worth putting some words
> about those situations in section 13.2 of PDUTR #28?

I think that that should be the case, given the current definitions.

In particular, if U+x = U+006E "n" and U+y = U+0067 "g", you would
get the following three possibilities for writing the Tagalog ng-tilde:

1. <U+006E, U+0360, U+0067>
2. <U+006E, U+FE22, U+0067, U+FE23>
3. <U+006E, U+034F, U+0067, U+0303>

1. uses the double-diacritic tilde, which nominally applies merely to
   the U+006E, but would be designed to lay over the top of a following
   base character on display.

2. uses the compatibility combining double-tilde halves. These occur
   in legacy bibliographic data records. In principle, 2 should display
   in the same way as 1, but would be recommended only for interoperating
   with the legacy data.

3. uses the grapheme joiner to create a "grapheme cluster", which in
   this case would be the digraph "ng". A rendering engine savvy to
   grapheme cluster status should then attempt to apply a following
   combining mark, in this case a regular combining tilde, to the entire
   grapheme cluster, rather than simply to the preceding base character.

While these are three alternative ways of representing the "same thing",
we aren't talking about canonical equivalences here. 3 creates a
grapheme cluster (which could have implications for other processing),
while 1 and 2 do not. For example, if I added U+0301 (combining acute)
after each of the above sequences, 1 would put the acute on the "g"
(and might result in overlap with the right half of the double tilde);
2 would put the acute over the right-half tilde on the "g"; 3 should
put the acute midships over the stretched tilde applying to the digraph.

2 is used for interoperating with legacy
bibliographic data, while 1 and 2 are not. And there are quite likely
to be other small formatting differences between the three options. In the
real world it is unlikely that you will run into a "perfect" rendering
engine that would produce exactly the same image from each of the
sequences.

The combining grapheme joiner is the best answer that Unicode currently
has for the extensibility problem for unusual accent placements over
(or under) groups of letters, where the existing compatibility answers
(U+0360..U+0362 for double diacritics; U+FE20..U+FE23 for diacritic halves)
aren't sufficient. For example, it makes it possible to represent a
double breve or a double macron over (as seen in some American dictionary
orthographies) or a double (or triple) underline under (as seen in some
transliterations).

--Ken



This archive was generated by hypermail 2.1.2 : Thu Jan 03 2002 - 18:30:14 EST