Re: Compatibility decompositions

From: Kenneth Whistler ([email protected])
Date: Wed Aug 13 2003 - 18:10:15 EDT

Next message: Mark Davis: "Re: [hebrew] Re: Consensus, draft 2"

Previous message: Jony Rosenne: "RE: Questions on ZWNBS - for line initial holam plus alef"
Maybe in reply to: John Cowan: "Compatibility decompositions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

John Cowan asked:

> I realize that existing compatibility decompositions are a rag-bag,
> especially those marked with the generic <compat> tag rather than one
> of the specific tags such as , <initial>, or <super>. I wonder
> what principles, if any, can be enunciated for giving a newly introduced
> character a compatibility decomposition at the present time?

Fortunately, I have just the material to hand to answer such a question --
a file listing all the additions to Unicode 3.2 and Unicode 4.0. We
can look in those tea leaves and divine the probable intentions of
the UTC, based on a pretty good sampling of 2000+ recent character
additions.

03F9;GREEK CAPITAL LUNATE SIGMA SYMBOL;Lu;<compat> 03A3;;;;03F2;

Reason: uppercase of U+03F2, which has a compatibility mapping

1D2C;MODIFIER LETTER CAPITAL A;Lm;<super> 0041;;;;;
...
1D61;MODIFIER LETTER SMALL CHI;Lm;<super> 03C7;;;;;

Reason: analogy to existing superscript modifier letters

1D62;LATIN SUBSCRIPT SMALL LETTER I;Ll; 0069;;;;;
...
1D6A;GREEK SUBSCRIPT SMALL LETTER CHI;Ll; 03C7;;;;;

Reason: analogy to existing superscript modifier letters
(but these are *sub*script)

2047;DOUBLE QUESTION MARK;Po;<compat> 003F 003F;;;;;

Reason: analogy to existing U+2048..U+2049

2057;QUADRUPLE PRIME;Po;<compat> 2032 2032 2032 2032;;;;;

Reason: analogy to existing U+2033..U+2034

205F;MEDIUM MATHEMATICAL SPACE;Zs;<compat> 0020;;;;;

Reason: analogy to existing fixed-width spaces

2071;SUPERSCRIPT LATIN SMALL LETTER I;Ll;<super> 0069;;;;;

Reason: analogy to existing U+207F superscript n

213D;DOUBLE-STRUCK SMALL GAMMA;Ll; 03B3;;;;;
...
2149;DOUBLE-STRUCK ITALIC SMALL J;Ll; 006A;;;;;

Reason: analogy to existing font variant letterlike symbols

2A0C;QUADRUPLE INTEGRAL OPERATOR;Sm;<compat> 222B 222B 222B 222B;;;;;

Reason: analogy to existing U+222C..U+222D

2A74;DOUBLE COLON EQUAL;Sm;<compat> 003A 003A 003D;;;;;
2A75;TWO CONSECUTIVE EQUALS SIGNS;Sm;<compat> 003D 003D;;;;;
2A76;THREE CONSECUTIVE EQUALS SIGNS;Sm;<compat> 003D 003D 003D;;;;;

Reason: symbols were explicitly representing sequences of
elements, but were single entities in the math entity set

309F;HIRAGANA DIGRAPH YORI;Lo;<vertical> 3088 308A;;;;;
30FF;KATAKANA DIGRAPH KOTO;Lo;<vertical> 30B3 30C8;;;;;

Reason: vertical ligated variants of Japanese syllable sequences

321D;PARENTHESIZED KOREAN CHARACTER OJEON;So;<compat> 0028 110B 1169 110C 1165
11AB 0029;;;;;
321E;PARENTHESIZED KOREAN CHARACTER O HU;So;<compat> 0028 110B 1169 1112 116E
0029;;;;;
3250;PARTNERSHIP SIGN;So;<square> 0050 0054 0045;;;;;

Reason: analogy with all the rest of the existing squared
compatibility characters originating in Korean standards

3251;CIRCLED NUMBER TWENTY ONE;No;<circle> 0032 0031;21;;;;
...
32BF;CIRCLED NUMBER FIFTY;No;<circle> 0035 0030;50;;;;

Reason: analogy with existing circled number characters

32CC;SQUARE HG;So;<square> 0048 0067;;;;;
...
33FF;SQUARE GAL;So;<square> 0067 0061 006C;;;;;

Reason: analogy with the rest of the existing squared
compatibility characters originating in Korean standards

FDFC;RIAL SIGN;Sc;<isolated> 0631 06CC 0627 0644;;;;;

Reason: explicit request in the proposal to provide decomposition,
approved by the committees

FE47;PRESENTATION FORM FOR VERTICAL LEFT SQUARE BRACKET;Ps;<vertical> 005B;;;;;
FE48;PRESENTATION FORM FOR VERTICAL RIGHT SQUARE BRACKET;Pe;<vertical> 005D;;;;;

Reason: analogy with existing vertical form variants

FF5F;FULLWIDTH LEFT WHITE PARENTHESIS;Ps;<wide> 2985;;*;;;
FF60;FULLWIDTH RIGHT WHITE PARENTHESIS;Pe;<wide> 2986;;*;;;

Reason: analogy with existing fullwidth characters

1D4C1;MATHEMATICAL SCRIPT SMALL L;Ll; 006C;;;;;

Reason: analogy with the rest of the math alphanumerics

And then there are canonical equivalences added:

2ADC;FORKING;Sm;2ADD 0338;;not independent;;;

  Reason: analogy with the other negated math symbols (and
     allowable under Unicode stability policies because the
     base character U+2ADD was encoded at the same time)

FA30;CJK COMPATIBILITY IDEOGRAPH-FA30;Lo;4FAE;;;;;
...
FA6A;CJK COMPATIBILITY IDEOGRAPH-FA6A;Lo;983B;;;;;

  Reason: analogy with the treatment of all the other Han
     compatibility characters.

So you can see from this that the overwhelming reason for providing
a compatibility (or canonical) decomposition for a newly encoded
character is analogy with the treatment of existing characters
which are arguably "just like" the character newly encoded.

The reason for that is *consistency* in the standard. It would
be less useful to have some characters treated one way for
decompositions and others (inexplicably, from the point of
view of implementers) treated another.

> In particular, is it sufficient that the character strongly resembles an
> existing character or combination of characters, but for one or another
> reason needs to be distinct from it?

I don't think strong resemblance to an existing character is enough.
There were plenty of examples among the math symbols of symbols
that strongly resembled already encoded characters (or each
other), and which might even ultimately have been derived from
each other as variants of some sort. But in the absence of
class analogies with existing sets of characters already
having compatibility decompositions, new compatibility decompositions
were not provided in such cases.

Strong resemblance to a currently representable character sequence
seems a stronger reason for introducing a compatibility decomposition.
This is consistently applied for the square CJK abbreviation
characters, for example, but also played a role in the cases
of U+2A74..U+2A76 and U+FDFC RIAL SIGN.

Incidentally, the Unicode stability policy does not preclude
the possible introduction of new categories of compability
decomposition tags, but I consider it unlikely that any will
be introduced, because of the potential problems of inconsistency
with existing decompositions (or lack thereof) which it could
produce, and because the existing set has settled in and
is implicated in implementations of collation, for example.
There would definitely be opposition in the UTC to disturbing
the tags arbitrarily.

--Ken

Next message: Mark Davis: "Re: [hebrew] Re: Consensus, draft 2"
Previous message: Jony Rosenne: "RE: Questions on ZWNBS - for line initial holam plus alef"
Maybe in reply to: John Cowan: "Compatibility decompositions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Aug 13 2003 - 18:58:38 EDT