L2/08-226

 

Properties for unassigned code points in CJK ideographs blocks
Eric Muller
Adobe Systems
May 16, 2008

Some areas of the Unicode code point space have been earmarked for RTL scripts, and the value of the Bidi_Class for unassigned code points in those areas has been set accordingly. The idea is to anticipate the value that a property will have once the code point is assigned to an abstract character, and “to maximize compatibility with expected future assignments” (TUS 5.0 p156). This anticipation is not binding in any way - the property value can be changed at the time of assignment (or possibly later). It is nevertheless useful, as it increases the likelyhood of a smooth transition when the code point is assigned; at that point, data using the new code points will meet “older” implementations.

In the same vein, we strongly expect that future assignments in the various blocks for CJK ideographs as well as in the rest of plane 2 (SIP) will be for CJK ideographs, and this proposal is to set the value of some properties for the unassigned code points accordingly.

The concerned areas and unassigned code points as of Unicode 5.1 are:

Block name Block range Unassigned code points
CJK Unified Ideographs Extension A 3400-4DBF 4DB6-4DBF
CJK Unified Ideographs 4E00-9FFF 9FC4-9FFF
CJK Compatibility Ideographs F900-FAFF FA2E-FA2F
    FA6B-FA6F
    FADA-FAFF
CJK Unified Ideographs Extension B 20000-2A6DF 2A6D7-2A6DF
(SIP outside blocks)   2A6E0-2F7FF
CJK Compatibility Ideographs Supplement 2F800-2FA1F 2FA1E-2FA1F
(SIP outside blocks)   2FA20-2FFFD

Should this proposal be adopted for some version of Unicode, code points that would become assigned by that version would be excluded from this proposal (and just get property values as part of the normal process of encoding).

The potential properties of interest are those which can be predicted accurately, and where the current assignment is different from the expected value:

short name long name assigned unassigned proposed
gc General_Category Lo Cn -
ea East_Asian_Width W

BMP: N
SIP, in block: W
SIP, outside block: N

W
lb Line_Break ID XX ID
sc Script Hani Zzzz -
Ideo Ideographic Y N -
UIdeo Unified_Ideograph Y/N N -
Alpha Alphabetic Y N -
GrBase Grapheme_Base
Y N -
IDS ID_Start Y N -
XIDS XID_Start Y N -
IDC ID_Continue Y N -
XIDC XID_Continue Y N -
SB Sentence_Break LE XX -

The properties East_Asian_Width and Line_Break describe the behavior of characters in rendering; predicting those two properties in particular would improve significantly the rendering of text when characters are assigned. The other properties are more about the identity of the characters, and while the prediction could be acurate, assigning predicted values for unassigned characters may be misleading, and adversely affect invariants.

The proposal is to assign East_Asian_Width = W and Line_Break = ID to the unassigned code points in the first table.