From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Aug 13 2003 - 18:10:15 EDT
John Cowan asked:
> I realize that existing compatibility decompositions are a rag-bag,
> especially those marked with the generic <compat> tag rather than one
> of the specific tags such as <font>, <initial>, or <super>. I wonder
> what principles, if any, can be enunciated for giving a newly introduced
> character a compatibility decomposition at the present time?
Fortunately, I have just the material to hand to answer such a question --
a file listing all the additions to Unicode 3.2 and Unicode 4.0. We
can look in those tea leaves and divine the probable intentions of
the UTC, based on a pretty good sampling of 2000+ recent character
additions.
03F9;GREEK CAPITAL LUNATE SIGMA SYMBOL;Lu;<compat> 03A3;;;;03F2;
Reason: uppercase of U+03F2, which has a compatibility mapping
1D2C;MODIFIER LETTER CAPITAL A;Lm;<super> 0041;;;;;
...
1D61;MODIFIER LETTER SMALL CHI;Lm;<super> 03C7;;;;;
Reason: analogy to existing superscript modifier letters
1D62;LATIN SUBSCRIPT SMALL LETTER I;Ll;<sub> 0069;;;;;
...
1D6A;GREEK SUBSCRIPT SMALL LETTER CHI;Ll;<sub> 03C7;;;;;
Reason: analogy to existing superscript modifier letters
(but these are *sub*script)
2047;DOUBLE QUESTION MARK;Po;<compat> 003F 003F;;;;;
Reason: analogy to existing U+2048..U+2049
2057;QUADRUPLE PRIME;Po;<compat> 2032 2032 2032 2032;;;;;
Reason: analogy to existing U+2033..U+2034
205F;MEDIUM MATHEMATICAL SPACE;Zs;<compat> 0020;;;;;
Reason: analogy to existing fixed-width spaces
2071;SUPERSCRIPT LATIN SMALL LETTER I;Ll;<super> 0069;;;;;
Reason: analogy to existing U+207F superscript n
213D;DOUBLE-STRUCK SMALL GAMMA;Ll;<font> 03B3;;;;;
...
2149;DOUBLE-STRUCK ITALIC SMALL J;Ll;<font> 006A;;;;;
Reason: analogy to existing font variant letterlike symbols
2A0C;QUADRUPLE INTEGRAL OPERATOR;Sm;<compat> 222B 222B 222B 222B;;;;;
Reason: analogy to existing U+222C..U+222D
2A74;DOUBLE COLON EQUAL;Sm;<compat> 003A 003A 003D;;;;;
2A75;TWO CONSECUTIVE EQUALS SIGNS;Sm;<compat> 003D 003D;;;;;
2A76;THREE CONSECUTIVE EQUALS SIGNS;Sm;<compat> 003D 003D 003D;;;;;
Reason: symbols were explicitly representing sequences of
elements, but were single entities in the math entity set
309F;HIRAGANA DIGRAPH YORI;Lo;<vertical> 3088 308A;;;;;
30FF;KATAKANA DIGRAPH KOTO;Lo;<vertical> 30B3 30C8;;;;;
Reason: vertical ligated variants of Japanese syllable sequences
321D;PARENTHESIZED KOREAN CHARACTER OJEON;So;<compat> 0028 110B 1169 110C 1165
11AB 0029;;;;;
321E;PARENTHESIZED KOREAN CHARACTER O HU;So;<compat> 0028 110B 1169 1112 116E
0029;;;;;
3250;PARTNERSHIP SIGN;So;<square> 0050 0054 0045;;;;;
Reason: analogy with all the rest of the existing squared
compatibility characters originating in Korean standards
3251;CIRCLED NUMBER TWENTY ONE;No;<circle> 0032 0031;21;;;;
...
32BF;CIRCLED NUMBER FIFTY;No;<circle> 0035 0030;50;;;;
Reason: analogy with existing circled number characters
32CC;SQUARE HG;So;<square> 0048 0067;;;;;
...
33FF;SQUARE GAL;So;<square> 0067 0061 006C;;;;;
Reason: analogy with the rest of the existing squared
compatibility characters originating in Korean standards
FDFC;RIAL SIGN;Sc;<isolated> 0631 06CC 0627 0644;;;;;
Reason: explicit request in the proposal to provide decomposition,
approved by the committees
FE47;PRESENTATION FORM FOR VERTICAL LEFT SQUARE BRACKET;Ps;<vertical> 005B;;;;;
FE48;PRESENTATION FORM FOR VERTICAL RIGHT SQUARE BRACKET;Pe;<vertical> 005D;;;;;
Reason: analogy with existing vertical form variants
FF5F;FULLWIDTH LEFT WHITE PARENTHESIS;Ps;<wide> 2985;;*;;;
FF60;FULLWIDTH RIGHT WHITE PARENTHESIS;Pe;<wide> 2986;;*;;;
Reason: analogy with existing fullwidth characters
1D4C1;MATHEMATICAL SCRIPT SMALL L;Ll;<font> 006C;;;;;
Reason: analogy with the rest of the math alphanumerics
And then there are canonical equivalences added:
2ADC;FORKING;Sm;2ADD 0338;;not independent;;;
Reason: analogy with the other negated math symbols (and
allowable under Unicode stability policies because the
base character U+2ADD was encoded at the same time)
FA30;CJK COMPATIBILITY IDEOGRAPH-FA30;Lo;4FAE;;;;;
...
FA6A;CJK COMPATIBILITY IDEOGRAPH-FA6A;Lo;983B;;;;;
Reason: analogy with the treatment of all the other Han
compatibility characters.
So you can see from this that the overwhelming reason for providing
a compatibility (or canonical) decomposition for a newly encoded
character is analogy with the treatment of existing characters
which are arguably "just like" the character newly encoded.
The reason for that is *consistency* in the standard. It would
be less useful to have some characters treated one way for
decompositions and others (inexplicably, from the point of
view of implementers) treated another.
> In particular, is it sufficient that the character strongly resembles an
> existing character or combination of characters, but for one or another
> reason needs to be distinct from it?
I don't think strong resemblance to an existing character is enough.
There were plenty of examples among the math symbols of symbols
that strongly resembled already encoded characters (or each
other), and which might even ultimately have been derived from
each other as variants of some sort. But in the absence of
class analogies with existing sets of characters already
having compatibility decompositions, new compatibility decompositions
were not provided in such cases.
Strong resemblance to a currently representable character sequence
seems a stronger reason for introducing a compatibility decomposition.
This is consistently applied for the square CJK abbreviation
characters, for example, but also played a role in the cases
of U+2A74..U+2A76 and U+FDFC RIAL SIGN.
Incidentally, the Unicode stability policy does not preclude
the possible introduction of new categories of compability
decomposition tags, but I consider it unlikely that any will
be introduced, because of the potential problems of inconsistency
with existing decompositions (or lack thereof) which it could
produce, and because the existing set has settled in and
is implicated in implementations of collation, for example.
There would definitely be opposition in the UTC to disturbing
the tags arbitrarily.
--Ken
This archive was generated by hypermail 2.1.5 : Wed Aug 13 2003 - 18:58:38 EDT