Source: Kent Karlsson Subject: Reaction to proposed change for Grapheme_Cluster_Break for version 6.1.0 (UAX 29) Date: 2011/09/13 The "diff" files for draft Unicode 6.1.0 lists: ========================= ----------------------------------------------- GCB (Grapheme_Cluster_Break) GCB 0378 'XX' -> 'CN' GCB 0379 'XX' -> 'CN' GCB 037F 'XX' -> 'CN' GCB 0380 'XX' -> 'CN' ... GCB EFFFB 'XX' -> 'CN' GCB EFFFC 'XX' -> 'CN' GCB EFFFD 'XX' -> 'CN' GCB EFFFE 'XX' -> 'CN' GCB EFFFF 'XX' -> 'CN' GCB FFFFE 'XX' -> 'CN' GCB FFFFF 'XX' -> 'CN' GCB 10FFFE 'XX' -> 'CN' GCB 10FFFF 'XX' -> 'CN' 1114112 + 0 - 0 # 866611 = 1114112 (0 ignored, 0 undefined) ======================== 1,114,112, over a million, changes for Grapheme_Cluster_Break. Mainly for unallocated code positions. Looking at http://www.unicode.org/reports/tr29/tr29-18.html (latest? draft) it says (some comments inline): ----- Control [= CN] General_Category = Line Separator (Zl), or General_Category = Paragraph Separator (Zp), or General_Category = Control (Cc), or General_Category = Control (Cn), or [new, but Cn is not control characters, it is reserved codepoints *union* non-characters] General_Category = Control (Cs), or [new, but Cs is not control characters, it is (isolated if UTF-16) "surrogate" codes] General_Category = Format (Cf) and not U+000D CARRIAGE RETURN (CR) and not U+000A LINE FEED (LF) and not U+200C ZERO WIDTH NON-JOINER (ZWNJ) and not U+200D ZERO WIDTH JOINER (ZWJ) --- The original proposal, http://www.unicode.org/L2/L2011/11266-uax29.html, suggested to add "the three odd-ball cases ([:cn:][:cs:][:co:]) to [:gcb:control:]" (Co is "private-use"). It's fine that Co (private-use) did not get the gcb:control property. But still Cn (reserved) got added as gcb:control. The latter is ok for non-characters (a small subset of Cn), but not for reserved code points in general. I think the vast majority of Cn code point (namely those that are not non-characters) should stay XX (Unknown) for GCB, just as Private-use code points. When a Cn code point gets allocated to a character it is rarely to a (format) control character.