L2/01-168 From: Kenneth Whistler [kenw@sybase.com] Sent: Tuesday, April 10, 2001 3:02 PM Subject: Bracket Disunification & Normalization Hell O.k., bracket disunification advocates, I have some questions for you. WG2 N2345R advocates the disunification of 6 existing CJK brackets, to provide explicit math forms. It also renames two math brackets from PDAM1, disunifies them, and provides 2 new CJK brackets for that pair. However, WG2 N2345R says *nothing* about the Unicode properties, including compatibility decompositions, if any, for the proposed new brackets. Before the UTC can sign off on these new characters, we are going to need a coherent story from the advocates regarding the complete set of properties for them. (I'm not planning to assign them myself and be left holding the bag when the nitpickers start pointing out inconsistencies.) Existing characters and their properties. All of the characters are Bidi ON, so I will omit that as predictable. Also, the Linebreak property is OP if the General Category is Ps and CL if the General Category is Pe, so that is also predictable. The issues revolve around the East Asian width property, the Other_Math property, and decompositions. GCat = Ps, EAW = Na, Other_Math = Y 0028 LEFT PARENTHESIS 005B LEFT SQUARE BRACKET 007B LEFT CURLY BRACKET GCat = Ps, EAW = F, Other_Math = Y FF08 FULLWIDTH LEFT PARENTHESIS ==> 0028 FF3B FULLWIDTH LEFT SQUARE BRACKET ==> 005B FF5B FULLWIDTH LEFT CURLY BRACKET ==> 007B GCat = Ps, EAW = A, Other_Math = Y 2329 LEFT-POINTING ANGLE BRACKET ==> 3008 3008 LEFT ANGLE BRACKET 301A LEFT WHITE SQUARE BRACKET GCat = Ps, EAW = A, Other_Math = N 300A LEFT DOUBLE ANGLE BRACKET 3014 LEFT TORTOISE SHELL BRACKET 3018 LEFT WHITE TORTOISE SHELL BRACKET GCat = Pe, EAW = Na, Other_Math = Y 0029 RIGHT PARENTHESIS 005D RIGHT SQUARE BRACKET 007D RIGHT CURLY BRACKET GCat = Pe, EAW = F, Other_Math = Y FF09 FULLWIDTH RIGHT PARENTHESIS ==> 0029 FF3D FULLWIDTH RIGHT SQUARE BRACKET ==> 005D FF5D FULLWIDTH RIGHT CURLY BRACKET ==> 007D GCat = Pe, EAW = A, Other_Math = Y 232A RIGHT-POINTING ANGLE BRACKET ==> 3009 3009 RIGHT ANGLE BRACKET 301B RIGHT WHITE SQUARE BRACKET GCat = Pe, EAW = A, Other_Math = N 300B RIGHT DOUBLE ANGLE BRACKET 3015 RIGHT TORTOISE SHELL BRACKET 3019 RIGHT WHITE TORTOISE SHELL BRACKET The proposed new characters are: 2B00 MATHEMATICAL LEFT WHITE SQUARE BRACKET 2B01 MATHEMATICAL RIGHT WHITE SQUARE BRACKET 2B02 MATHEMATICAL LEFT ANGLE BRACKET 2B03 MATHEMATICAL RIGHT ANGLE BRACKET 2B04 MATHEMATICAL LEFT DOUBLE ANGLE BRACKET 2B05 MATHEMATICAL RIGHT DOUBLE ANGLE BRACKET 2985 MATHEMATICAL WHITE LEFT PARENTHESIS 2986 MATHEMATICAL WHITE RIGHT PARENTHESIS 33DE WHITE LEFT PARENTHESIS 33DF WHITE RIGHT PARENTHESIS The first 6 are explicitly cloned narrow versions of existing brackets in the CJK punctuation block. The last 4 are a cloned-at-birth pair for newly encoded white parentheses. Let's take the first 6 first. Presumably these are all intended as EAW = Na and Other_Math = Y. But that raises the question of what to do about the properties of the characters they are cloned from. Presumably, 3008, 3009, 300A, 300B, 301A, 301B switch from EAW = A to EAW = W, since the whole point of the cloning is to remove the width ambiguity on the CJK characters. Because of the canonical equivalence defined for 2329 and 232A, they would presumably also switch to EAW = W. Regarding the math property, the 6 new characters are explicitly intended for math, so would get Other_Math = Y. But that raises the question whether the now explicitly contrasting characters 2329, 232A, 3008, 3009, 301A, 301B should have their Other_Math property switched to N, as they would no longer be the suggested versions of the brackets to use in math itself. And then there is the stickiest question: compatibility decompositions. What is going on here is a disunification based on a compatibility issue--character width and glyph positioning in CJK typographical contexts as contrasted with mathematical contexts. In the ordinary course of affairs, one would expect one of each pair to be designated the "real" character, and the other to be given a compatibility mapping to that character. But we have a problem here. The prototype for these CJK clones is established by the fullwidth ASCII: FF08 FULLWIDTH LEFT PARENTHESIS ==> 0028 But this pattern fails for the newly suggested disunification clones because of the legacy status of the CJK punctuation in the standard. We cannot now add compatibility decompositions for any of them, since that would break normalization. That leaves the alternative: 2B00 MATHEMATICAL LEFT WHITE SQUARE BRACKET ==> 301A and so on. Or we could claim no compatibility decompositions should be provided at all for the new characters, despite the fact that they are proposed for encoding explicitly as compatibility disunification clones. Whichever route we take, however, gets us into normalization hell. 1. Using the decompositions, normalization forms KD and KC would normalize some of the pairs to ASCII (narrow) and some of the pairs to CJK punctuation (wide). That is an inconsistency that belies the nature of the intended contrasts here. 2. Using no decompositions, normalization forms KD and KC would normalize the existing pairs to ASCII, but would claim that the new disunifications are distinct and don't normalize to the same characters. That is *also* inconsistent with the intent of these characters. So which is it guys? Which inconsistency are you advocating here for these 6 characters? There is also another potential problem lurking here. To date, all characters given a compatibility decomposition are "FULLWIDTH" and EAW = F, and all characters given a compatibility decomposition are "HALFWIDTH" and EAW = H. If any decompositions are given for the new characters, that will break the existing invariant by introducing new characters that are neither "HALFWIDTH" nor EAW = H. (This because their cloned status is not derivative from an East Asian legacy character set single-byte/double-byte encoding distinction.) Now for the second set of four new characters. These differ from the first 6 in not being clones of existing characters. That means that the option of designating the new CJK characters as variants of the math version is available. That would be more consistent with the treatment of existing fullwidth ASCII parentheses and brackets, but would be inconsistent with the solutions available for the first 6. So which is it guys? Which properties and decompositions are you advocating for the 4 new characters? --Ken