L2/01-168
From: Kenneth Whistler [kenw@sybase.com]
Sent: Tuesday, April 10, 2001 3:02 PM
Subject: Bracket Disunification & Normalization Hell
O.k., bracket disunification advocates, I have some questions
for you.
WG2 N2345R advocates the disunification of 6 existing CJK brackets,
to provide explicit math forms. It also renames two math brackets
from PDAM1, disunifies them, and provides 2 new CJK brackets for
that pair.
However, WG2 N2345R says *nothing* about the Unicode properties,
including compatibility decompositions, if any, for the proposed
new brackets. Before the UTC can sign off on these new characters,
we are going to need a coherent story from the advocates regarding
the complete set of properties for them. (I'm not planning to assign
them myself and be left holding the bag when the nitpickers start
pointing out inconsistencies.)
Existing characters and their properties. All of the characters
are Bidi ON, so I will omit that as predictable. Also, the Linebreak
property is OP if the General Category is Ps and CL if the
General Category is Pe, so that is also predictable. The issues
revolve around the East Asian width property, the Other_Math
property, and decompositions.
GCat = Ps, EAW = Na, Other_Math = Y
0028 LEFT PARENTHESIS
005B LEFT SQUARE BRACKET
007B LEFT CURLY BRACKET
GCat = Ps, EAW = F, Other_Math = Y
FF08 FULLWIDTH LEFT PARENTHESIS ==> 0028
FF3B FULLWIDTH LEFT SQUARE BRACKET ==> 005B
FF5B FULLWIDTH LEFT CURLY BRACKET ==> 007B
GCat = Ps, EAW = A, Other_Math = Y
2329 LEFT-POINTING ANGLE BRACKET ==> 3008
3008 LEFT ANGLE BRACKET
301A LEFT WHITE SQUARE BRACKET
GCat = Ps, EAW = A, Other_Math = N
300A LEFT DOUBLE ANGLE BRACKET
3014 LEFT TORTOISE SHELL BRACKET
3018 LEFT WHITE TORTOISE SHELL BRACKET
GCat = Pe, EAW = Na, Other_Math = Y
0029 RIGHT PARENTHESIS
005D RIGHT SQUARE BRACKET
007D RIGHT CURLY BRACKET
GCat = Pe, EAW = F, Other_Math = Y
FF09 FULLWIDTH RIGHT PARENTHESIS ==> 0029
FF3D FULLWIDTH RIGHT SQUARE BRACKET ==> 005D
FF5D FULLWIDTH RIGHT CURLY BRACKET ==> 007D
GCat = Pe, EAW = A, Other_Math = Y
232A RIGHT-POINTING ANGLE BRACKET ==> 3009
3009 RIGHT ANGLE BRACKET
301B RIGHT WHITE SQUARE BRACKET
GCat = Pe, EAW = A, Other_Math = N
300B RIGHT DOUBLE ANGLE BRACKET
3015 RIGHT TORTOISE SHELL BRACKET
3019 RIGHT WHITE TORTOISE SHELL BRACKET
The proposed new characters are:
2B00 MATHEMATICAL LEFT WHITE SQUARE BRACKET
2B01 MATHEMATICAL RIGHT WHITE SQUARE BRACKET
2B02 MATHEMATICAL LEFT ANGLE BRACKET
2B03 MATHEMATICAL RIGHT ANGLE BRACKET
2B04 MATHEMATICAL LEFT DOUBLE ANGLE BRACKET
2B05 MATHEMATICAL RIGHT DOUBLE ANGLE BRACKET
2985 MATHEMATICAL WHITE LEFT PARENTHESIS
2986 MATHEMATICAL WHITE RIGHT PARENTHESIS
33DE WHITE LEFT PARENTHESIS
33DF WHITE RIGHT PARENTHESIS
The first 6 are explicitly cloned narrow versions of existing
brackets in the CJK punctuation block. The last 4 are a
cloned-at-birth pair for newly encoded white parentheses.
Let's take the first 6 first. Presumably these are all intended
as EAW = Na and Other_Math = Y. But that raises the question of
what to do about the properties of the characters they are cloned
from. Presumably, 3008, 3009, 300A, 300B, 301A, 301B switch from
EAW = A to EAW = W, since the whole point of the cloning is to
remove the width ambiguity on the CJK characters. Because of
the canonical equivalence defined for 2329 and 232A, they would
presumably also switch to EAW = W.
Regarding the math property, the 6 new characters are explicitly
intended for math, so would get Other_Math = Y. But that raises
the question whether the now explicitly contrasting characters
2329, 232A, 3008, 3009, 301A, 301B should have their Other_Math
property switched to N, as they would no longer be the suggested
versions of the brackets to use in math itself.
And then there is the stickiest question: compatibility
decompositions. What is going on here is a disunification based
on a compatibility issue--character width and glyph positioning
in CJK typographical contexts as contrasted with mathematical
contexts. In the ordinary course of affairs, one would expect
one of each pair to be designated the "real" character, and the
other to be given a compatibility mapping to that character.
But we have a problem here. The prototype for these CJK clones
is established by the fullwidth ASCII:
FF08 FULLWIDTH LEFT PARENTHESIS ==> 0028
But this pattern fails for the newly suggested disunification
clones because of the legacy status of the CJK punctuation in
the standard. We cannot now add compatibility decompositions for
any of them, since that would break normalization. That leaves
the alternative:
2B00 MATHEMATICAL LEFT WHITE SQUARE BRACKET ==> 301A
and so on. Or we could claim no compatibility decompositions should
be provided at all for the new characters, despite the fact that
they are proposed for encoding explicitly as compatibility
disunification clones. Whichever route we take, however, gets us
into normalization hell.
1. Using the decompositions, normalization forms KD and KC
would normalize some of the pairs to ASCII (narrow) and some of the
pairs to CJK punctuation (wide). That is an inconsistency that
belies the nature of the intended contrasts here.
2. Using no decompositions, normalization forms KD and KC would
normalize the existing pairs to ASCII, but would claim that the
new disunifications are distinct and don't normalize to the same
characters. That is *also* inconsistent with the intent of these
characters.
So which is it guys? Which inconsistency are you advocating here
for these 6 characters?
There is also another potential problem lurking here. To date,
all characters given a compatibility decomposition are
"FULLWIDTH" and EAW = F, and all characters given a
compatibility decomposition are "HALFWIDTH" and EAW = H. If any
decompositions are given for the new characters, that
will break the existing invariant by introducing new characters
that are neither "HALFWIDTH" nor EAW = H. (This because their
cloned status is not derivative from an East Asian legacy character
set single-byte/double-byte encoding distinction.)
Now for the second set of four new characters. These differ from
the first 6 in not being clones of existing characters. That
means that the option of designating the new CJK characters as
variants of the math version is available. That would be
more consistent with the treatment of existing fullwidth ASCII
parentheses and brackets, but would be inconsistent with the
solutions available for the first 6.
So which is it guys? Which properties and decompositions are you
advocating for the 4 new characters?
--Ken