L2/01-168

From: Kenneth Whistler [kenw@sybase.com]
Sent: Tuesday, April 10, 2001 3:02 PM

Subject: Bracket Disunification & Normalization Hell

O.k., bracket disunification advocates, I have some questions
for you.

WG2 N2345R advocates the disunification of 6 existing CJK brackets,
to provide explicit math forms. It also renames two math brackets
from PDAM1, disunifies them, and provides 2 new CJK brackets for
that pair.

However, WG2 N2345R says *nothing* about the Unicode properties,
including compatibility decompositions, if any, for the proposed
new brackets. Before the UTC can sign off on these new characters,
we are going to need a coherent story from the advocates regarding
the complete set of properties for them. (I'm not planning to assign
them myself and be left holding the bag when the nitpickers start
pointing out inconsistencies.)

Existing characters and their properties. All of the characters
are Bidi ON, so I will omit that as predictable. Also, the Linebreak
property is OP if the General Category is Ps and CL if the
General Category is Pe, so that is also predictable. The issues
revolve around the East Asian width property, the Other_Math
property, and decompositions.

GCat = Ps, EAW = Na, Other_Math = Y

0028	LEFT PARENTHESIS
005B	LEFT SQUARE BRACKET
007B	LEFT CURLY BRACKET

GCat = Ps, EAW = F,  Other_Math = Y

FF08	FULLWIDTH LEFT PARENTHESIS	==> <wide> 0028
FF3B	FULLWIDTH LEFT SQUARE BRACKET	==> <wide> 005B
FF5B	FULLWIDTH LEFT CURLY BRACKET	==> <wide> 007B

GCat = Ps, EAW = A,  Other_Math = Y

2329	LEFT-POINTING ANGLE BRACKET	==> 3008
3008	LEFT ANGLE BRACKET
301A	LEFT WHITE SQUARE BRACKET

GCat = Ps, EAW = A,  Other_Math = N

300A	LEFT DOUBLE ANGLE BRACKET
3014	LEFT TORTOISE SHELL BRACKET
3018	LEFT WHITE TORTOISE SHELL BRACKET

GCat = Pe, EAW = Na, Other_Math = Y

0029	RIGHT PARENTHESIS
005D	RIGHT SQUARE BRACKET
007D	RIGHT CURLY BRACKET

GCat = Pe, EAW = F,  Other_Math = Y

FF09	FULLWIDTH RIGHT PARENTHESIS	==> <wide> 0029
FF3D	FULLWIDTH RIGHT SQUARE BRACKET	==> <wide> 005D
FF5D	FULLWIDTH RIGHT CURLY BRACKET	==> <wide> 007D

GCat = Pe, EAW = A,  Other_Math = Y

232A	RIGHT-POINTING ANGLE BRACKET	==> 3009
3009	RIGHT ANGLE BRACKET
301B	RIGHT WHITE SQUARE BRACKET

GCat = Pe, EAW = A,  Other_Math = N

300B	RIGHT DOUBLE ANGLE BRACKET
3015	RIGHT TORTOISE SHELL BRACKET
3019	RIGHT WHITE TORTOISE SHELL BRACKET


The proposed new characters are:

2B00	MATHEMATICAL LEFT WHITE SQUARE BRACKET
2B01	MATHEMATICAL RIGHT WHITE SQUARE BRACKET
2B02	MATHEMATICAL LEFT ANGLE BRACKET
2B03	MATHEMATICAL RIGHT ANGLE BRACKET
2B04	MATHEMATICAL LEFT DOUBLE ANGLE BRACKET
2B05	MATHEMATICAL RIGHT DOUBLE ANGLE BRACKET

2985	MATHEMATICAL WHITE LEFT PARENTHESIS
2986	MATHEMATICAL WHITE RIGHT PARENTHESIS
33DE	WHITE LEFT PARENTHESIS
33DF	WHITE RIGHT PARENTHESIS

The first 6 are explicitly cloned narrow versions of existing
brackets in the CJK punctuation block. The last 4 are a
cloned-at-birth pair for newly encoded white parentheses.

Let's take the first 6 first. Presumably these are all intended
as EAW = Na and Other_Math = Y. But that raises the question of
what to do about the properties of the characters they are cloned
from. Presumably, 3008, 3009, 300A, 300B, 301A, 301B switch from
EAW = A to EAW = W, since the whole point of the cloning is to
remove the width ambiguity on the CJK characters. Because of
the canonical equivalence defined for 2329 and 232A, they would
presumably also switch to EAW = W.

Regarding the math property, the 6 new characters are explicitly
intended for math, so would get Other_Math = Y. But that raises
the question whether the now explicitly contrasting characters
2329, 232A, 3008, 3009, 301A, 301B should have their Other_Math
property switched to N, as they would no longer be the suggested
versions of the brackets to use in math itself.

And then there is the stickiest question: compatibility
decompositions. What is going on here is a disunification based
on a compatibility issue--character width and glyph positioning
in CJK typographical contexts as contrasted with mathematical
contexts. In the ordinary course of affairs, one would expect
one of each pair to be designated the "real" character, and the
other to be given a compatibility mapping to that character.
But we have a problem here. The prototype for these CJK clones
is established by the fullwidth ASCII:

FF08	FULLWIDTH LEFT PARENTHESIS	==> <wide> 0028

But this pattern fails for the newly suggested disunification
clones because of the legacy status of the CJK punctuation in
the standard. We cannot now add compatibility decompositions for
any of them, since that would break normalization. That leaves
the alternative:

2B00	MATHEMATICAL LEFT WHITE SQUARE BRACKET ==> <narrow> 301A

and so on. Or we could claim no compatibility decompositions should
be provided at all for the new characters, despite the fact that
they are proposed for encoding explicitly as compatibility 
disunification clones. Whichever route we take, however, gets us
into normalization hell.

1. Using the <narrow> decompositions, normalization forms KD and KC
would normalize some of the pairs to ASCII (narrow) and some of the
pairs to CJK punctuation (wide). That is an inconsistency that
belies the nature of the intended contrasts here.

2. Using no decompositions, normalization forms KD and KC would
normalize the existing pairs to ASCII, but would claim that the
new disunifications are distinct and don't normalize to the same
characters. That is *also* inconsistent with the intent of these
characters.

So which is it guys? Which inconsistency are you advocating here
for these 6 characters?

There is also another potential problem lurking here. To date,
all characters given a <wide> compatibility decomposition are
"FULLWIDTH" and EAW = F, and all characters given a <narrow>
compatibility decomposition are "HALFWIDTH" and EAW = H. If any
<narrow> decompositions are given for the new characters, that
will break the existing invariant by introducing new characters
that are neither "HALFWIDTH" nor EAW = H. (This because their
cloned status is not derivative from an East Asian legacy character
set single-byte/double-byte encoding distinction.)

Now for the second set of four new characters. These differ from
the first 6 in not being clones of existing characters. That
means that the option of designating the new CJK characters as
<wide> variants of the math version is available. That would be
more consistent with the treatment of existing fullwidth ASCII
parentheses and brackets, but would be inconsistent with the
solutions available for the first 6.

So which is it guys? Which properties and decompositions are you
advocating for the 4 new characters?

--Ken