Untitled Document

L2/08-226

Properties for unassigned code points in CJK ideographs blocks
Eric Muller
Adobe Systems
May 16, 2008

Some areas of the Unicode code point space have been earmarked for RTL scripts, and the value of the Bidi_Class for unassigned code points in those areas has been set accordingly. The idea is to anticipate the value that a property will have once the code point is assigned to an abstract character, and “to maximize compatibility with expected future assignments” (TUS 5.0 p156). This anticipation is not binding in any way - the property value can be changed at the time of assignment (or possibly later). It is nevertheless useful, as it increases the likelyhood of a smooth transition when the code point is assigned; at that point, data using the new code points will meet “older” implementations.

In the same vein, we strongly expect that future assignments in the various blocks for CJK ideographs as well as in the rest of plane 2 (SIP) will be for CJK ideographs, and this proposal is to set the value of some properties for the unassigned code points accordingly.

The concerned areas and unassigned code points as of Unicode 5.1 are:

Block name	Block range	Unassigned code points
CJK Unified Ideographs Extension A	3400-4DBF	4DB6-4DBF
CJK Unified Ideographs	4E00-9FFF	9FC4-9FFF
CJK Compatibility Ideographs	F900-FAFF	FA2E-FA2F
		FA6B-FA6F
		FADA-FAFF
CJK Unified Ideographs Extension B	20000-2A6DF	2A6D7-2A6DF
(SIP outside blocks)		2A6E0-2F7FF
CJK Compatibility Ideographs Supplement	2F800-2FA1F	2FA1E-2FA1F
(SIP outside blocks)		2FA20-2FFFD

Should this proposal be adopted for some version of Unicode, code points that would become assigned by that version would be excluded from this proposal (and just get property values as part of the normal process of encoding).

The potential properties of interest are those which can be predicted accurately, and where the current assignment is different from the expected value:

short name	long name	assigned	unassigned	proposed
gc	General_Category	Lo	Cn	-
ea	East_Asian_Width	W	BMP: N SIP, in block: W SIP, outside block: N	W
lb	Line_Break	ID	XX	ID
sc	Script	Hani	Zzzz	-
Ideo	Ideographic	Y	N	-
UIdeo	Unified_Ideograph	Y/N	N	-
Alpha	Alphabetic	Y	N	-
GrBase	Grapheme_Base	Y	N	-
IDS	ID_Start	Y	N	-
XIDS	XID_Start	Y	N	-
IDC	ID_Continue	Y	N	-
XIDC	XID_Continue	Y	N	-
SB	Sentence_Break	LE	XX	-

The properties East_Asian_Width and Line_Break describe the behavior of characters in rendering; predicting those two properties in particular would improve significantly the rendering of text when characters are assigned. The other properties are more about the identity of the characters, and while the prediction could be acurate, assigning predicted values for unassigned characters may be misleading, and adversely affect invariants.

The proposal is to assign East_Asian_Width = W and Line_Break = ID to the unassigned code points in the first table.

L2/08-226

Properties for unassigned code points in CJK ideographs blocks Eric Muller Adobe Systems May 16, 2008

Properties for unassigned code points in CJK ideographs blocks
Eric Muller
Adobe Systems
May 16, 2008