Re: Compatibility decompositions

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Aug 13 2003 - 18:10:15 EDT

  • Next message: Mark Davis: "Re: [hebrew] Re: Consensus, draft 2"

    John Cowan asked:

    > I realize that existing compatibility decompositions are a rag-bag,
    > especially those marked with the generic <compat> tag rather than one
    > of the specific tags such as <font>, <initial>, or <super>. I wonder
    > what principles, if any, can be enunciated for giving a newly introduced
    > character a compatibility decomposition at the present time?

    Fortunately, I have just the material to hand to answer such a question --
    a file listing all the additions to Unicode 3.2 and Unicode 4.0. We
    can look in those tea leaves and divine the probable intentions of
    the UTC, based on a pretty good sampling of 2000+ recent character
    additions.

    03F9;GREEK CAPITAL LUNATE SIGMA SYMBOL;Lu;<compat> 03A3;;;;03F2;

      Reason: uppercase of U+03F2, which has a compatibility mapping

    1D2C;MODIFIER LETTER CAPITAL A;Lm;<super> 0041;;;;;
    ...
    1D61;MODIFIER LETTER SMALL CHI;Lm;<super> 03C7;;;;;

      Reason: analogy to existing superscript modifier letters

    1D62;LATIN SUBSCRIPT SMALL LETTER I;Ll;<sub> 0069;;;;;
    ...
    1D6A;GREEK SUBSCRIPT SMALL LETTER CHI;Ll;<sub> 03C7;;;;;

      Reason: analogy to existing superscript modifier letters
          (but these are *sub*script)

    2047;DOUBLE QUESTION MARK;Po;<compat> 003F 003F;;;;;

      Reason: analogy to existing U+2048..U+2049

    2057;QUADRUPLE PRIME;Po;<compat> 2032 2032 2032 2032;;;;;

      Reason: analogy to existing U+2033..U+2034

    205F;MEDIUM MATHEMATICAL SPACE;Zs;<compat> 0020;;;;;

      Reason: analogy to existing fixed-width spaces

    2071;SUPERSCRIPT LATIN SMALL LETTER I;Ll;<super> 0069;;;;;

      Reason: analogy to existing U+207F superscript n

    213D;DOUBLE-STRUCK SMALL GAMMA;Ll;<font> 03B3;;;;;
    ...
    2149;DOUBLE-STRUCK ITALIC SMALL J;Ll;<font> 006A;;;;;

      Reason: analogy to existing font variant letterlike symbols

    2A0C;QUADRUPLE INTEGRAL OPERATOR;Sm;<compat> 222B 222B 222B 222B;;;;;

      Reason: analogy to existing U+222C..U+222D

    2A74;DOUBLE COLON EQUAL;Sm;<compat> 003A 003A 003D;;;;;
    2A75;TWO CONSECUTIVE EQUALS SIGNS;Sm;<compat> 003D 003D;;;;;
    2A76;THREE CONSECUTIVE EQUALS SIGNS;Sm;<compat> 003D 003D 003D;;;;;

      Reason: symbols were explicitly representing sequences of
         elements, but were single entities in the math entity set

    309F;HIRAGANA DIGRAPH YORI;Lo;<vertical> 3088 308A;;;;;
    30FF;KATAKANA DIGRAPH KOTO;Lo;<vertical> 30B3 30C8;;;;;

      Reason: vertical ligated variants of Japanese syllable sequences

    321D;PARENTHESIZED KOREAN CHARACTER OJEON;So;<compat> 0028 110B 1169 110C 1165
    11AB 0029;;;;;
    321E;PARENTHESIZED KOREAN CHARACTER O HU;So;<compat> 0028 110B 1169 1112 116E
    0029;;;;;
    3250;PARTNERSHIP SIGN;So;<square> 0050 0054 0045;;;;;

      Reason: analogy with all the rest of the existing squared
         compatibility characters originating in Korean standards

    3251;CIRCLED NUMBER TWENTY ONE;No;<circle> 0032 0031;21;;;;
    ...
    32BF;CIRCLED NUMBER FIFTY;No;<circle> 0035 0030;50;;;;

      Reason: analogy with existing circled number characters

    32CC;SQUARE HG;So;<square> 0048 0067;;;;;
    ...
    33FF;SQUARE GAL;So;<square> 0067 0061 006C;;;;;

      Reason: analogy with the rest of the existing squared
         compatibility characters originating in Korean standards

    FDFC;RIAL SIGN;Sc;<isolated> 0631 06CC 0627 0644;;;;;

      Reason: explicit request in the proposal to provide decomposition,
         approved by the committees

    FE47;PRESENTATION FORM FOR VERTICAL LEFT SQUARE BRACKET;Ps;<vertical> 005B;;;;;
    FE48;PRESENTATION FORM FOR VERTICAL RIGHT SQUARE BRACKET;Pe;<vertical> 005D;;;;;

      Reason: analogy with existing vertical form variants
      
    FF5F;FULLWIDTH LEFT WHITE PARENTHESIS;Ps;<wide> 2985;;*;;;
    FF60;FULLWIDTH RIGHT WHITE PARENTHESIS;Pe;<wide> 2986;;*;;;

      Reason: analogy with existing fullwidth characters

    1D4C1;MATHEMATICAL SCRIPT SMALL L;Ll;<font> 006C;;;;;

      Reason: analogy with the rest of the math alphanumerics

    And then there are canonical equivalences added:

    2ADC;FORKING;Sm;2ADD 0338;;not independent;;;

      Reason: analogy with the other negated math symbols (and
         allowable under Unicode stability policies because the
         base character U+2ADD was encoded at the same time)

    FA30;CJK COMPATIBILITY IDEOGRAPH-FA30;Lo;4FAE;;;;;
    ...
    FA6A;CJK COMPATIBILITY IDEOGRAPH-FA6A;Lo;983B;;;;;

      Reason: analogy with the treatment of all the other Han
         compatibility characters.
         
    So you can see from this that the overwhelming reason for providing
    a compatibility (or canonical) decomposition for a newly encoded
    character is analogy with the treatment of existing characters
    which are arguably "just like" the character newly encoded.

    The reason for that is *consistency* in the standard. It would
    be less useful to have some characters treated one way for
    decompositions and others (inexplicably, from the point of
    view of implementers) treated another.

    > In particular, is it sufficient that the character strongly resembles an
    > existing character or combination of characters, but for one or another
    > reason needs to be distinct from it?

    I don't think strong resemblance to an existing character is enough.
    There were plenty of examples among the math symbols of symbols
    that strongly resembled already encoded characters (or each
    other), and which might even ultimately have been derived from
    each other as variants of some sort. But in the absence of
    class analogies with existing sets of characters already
    having compatibility decompositions, new compatibility decompositions
    were not provided in such cases.

    Strong resemblance to a currently representable character sequence
    seems a stronger reason for introducing a compatibility decomposition.
    This is consistently applied for the square CJK abbreviation
    characters, for example, but also played a role in the cases
    of U+2A74..U+2A76 and U+FDFC RIAL SIGN.

    Incidentally, the Unicode stability policy does not preclude
    the possible introduction of new categories of compability
    decomposition tags, but I consider it unlikely that any will
    be introduced, because of the potential problems of inconsistency
    with existing decompositions (or lack thereof) which it could
    produce, and because the existing set has settled in and
    is implicated in implementations of collation, for example.
    There would definitely be opposition in the UTC to disturbing
    the tags arbitrarily.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Aug 13 2003 - 18:58:38 EDT