Re: GRAPHEME JOINER vs. double diacritics

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jan 08 2002 - 17:26:45 EST


O.k., o.k., as Kent and Mark have pointed out, I've
already managed to make my first significant error of the new year.

The intent and wording of the PDUTR #28 text on the CGJ is best
stated in the Article II.3.9 Application of Combining Marks --
a section I overlooked in responding previously to Eric Muller's
query.

The problem, of course, is that if you start to apply ordinary
combining marks to entire grapheme clusters comprised of sequences
with the CGJ, you run afoul of canonical equivalences involving
those combining marks. The same thing does not apply for the
enclosing combining marks, since there are no canonical equivalences
involving those combining marks.

So, taking that into consideration, here is my restatement of
what I think ought to happen for the three possible cases for
the ng-tilde:

1. <U+006E, U+0360, U+0067>
2. <U+006E, U+FE22, U+0067, U+FE23>
3. <U+006E, U+034F, U+0067, U+0303>

1. uses the double-diacritic tilde, which nominally applies merely to
   the U+006E, but would be designed to lay over the top of a following
   base character on display.

2. uses the compatibility combining double-tilde halves. These occur
   in legacy bibliographic data records. In principle, 2 should display
   in the same way as 1, but would be recommended only for interoperating
   with the legacy data.

3. uses the grapheme joiner to create a "grapheme cluster", which in
   this case would be the digraph "ng". Unlike 1 and 2, the tilde would
   apply only to the "g", so that 3 would not display the same as 1 or 2.

To illustrate the canonical equivalence question, consider:

1a. <U+0061, U+0061, U+0301> ==> aá
1b. <U+0061, U+00E1> ==> aá

1a and 1b are canonically equivalent sequences, and should display
the same.

2a. <U+0061, U+034F, U+0061, U+0301>
2b. <U+0061, U+034F, U+00E1>

Now if we insert a CGJ between the two a's, the
sequences are still canonically equivalent, and should display the
same. If, however, we say that the creating an "aa" grapheme cluster
changes the context over which the following acute accent will display,
then we have a situation where canonically equivalent sequences have
consistently different display (and possibly interpretation). That
wouldn't be a good thing -- hence the wording in PDUTR #28 to preclude
the application of combining marks to other than the base character
they follow (except for enclosing combining marks or other
specified exceptions).

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jan 08 2002 - 17:05:35 EST