Re: Decomposing Diacritics Allowed?

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Dec 16 1999 - 17:30:37 EST


Jim Agenbroad asked:

> Thursday, December 16, 1999
> Unicoders,
> Does "dynamic composition" and "equivalent sequences" (2.0, p.2-9)
> include composition of combining characters from their parts?

The answer to this kind of question should always be sought in
UnicodeData.txt, which contains the normative statement of
the Unicode Standard regarding decomposition of characters.

> Some
> examples:
> Is U+0306 (brev) followed by U+0307 (dot above) the same as U+0310
> (candribindu)?

No.

0310;COMBINING CANDRABINDU;Mn;230;NSM;;;;;N;NON-SPACING CANDRABINDU;;;;

Note that the status of this particular character's decomposition changed
between Unicode update version 2.1.5 and Unicode update version 2.1.8:

UnicodeData-2.1.5.txt:0310;COMBINING CANDRABINDU;Mn;230;ON;0306 0307;;;;N;NON-SPACING CANDRABINDU;;;;

UnicodeData-2.1.8.txt:0310;COMBINING CANDRABINDU;Mn;230;ON;;;;;N;NON-SPACING CANDRABINDU;;;;

The UTC decided to change the status of this character's decomposition because
of complications that arose in normalization involving this combining mark.

> Is U+0301 (acute) twice the same as U+030B (double acute)?
> Is U+307 (dot above) twice the same as U+308 (diaresis)?

No to both.

030B;COMBINING DOUBLE ACUTE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING DOUBLE ACUTE;;;;
0308;COMBINING DIAERESIS;Mn;230;NSM;;;;;N;NON-SPACING DIAERESIS;Dialytika;;;

> Examples 2 and 3 would need careful lateral placement to avoid
> unwanted overlap. The "from the center out" rule would solve the
> order of codes in the first example and it doesn't matter in the other
> two.
> If this has been decided and recorded could someone point me to it?
> If it has not should it be? I view all three examples as undesirable, but
> at the same time I must admit that I favor decomposition of multiple
> combining characters for Vietnamese (U+1EA4 and following).

Normatively accounted for:

1EA4;LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE;Lu;0;L;00C2 0301;;;;N;;;;1EA5;
00C2;LATIN CAPITAL LETTER A WITH CIRCUMFLEX;Lu;0;L;0041 0302;;;;N;LATIN CAPITAL LETTER A CIRCUMFLEX;;;00E2;

implies: 1EA4 ==> 00C2 + 0301 ==> 0041 + 0302 + 0301

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:56 EDT