L2/01-471 Title: Block Boundary Fixes for 10646 Date: November 30, 2001 Source: Ken Whistler Action: For consideration by UTC and L2 References: L2/01-422 At the November, 2001 UTC meeting, the UTC considered L2/01-422, regarding a number of issues regarding the formal block boundaries for the Unicode Standard and 10646. As a result of the discussion, consensus decisions were taken about all of the issues pointed out in that document. The required actions regarding changes to the NamesList.txt and Blocks.txt data files for the Unicode Standard have all been completed. The remaining action is to make the relevant fixes to 10646, which must be done either by amendment or technical corrigendum. I was tasked to write up the implications of the decisions the UTC took regarding block boundaries for 10646, so that they could be discussed again specifically in the context of preparation of ballot comments for PDAM 2 to 10646-1 and PDAM 1 to 10646-2. The implications are detailed below. *********************************************************** 1. 10646 currently has no blocks defined for the private use characters in Planes 15 and 16. This is inconsistent, since a block *is* defined for the BMP private use characters. I suggest that the following text be requested to be added to 10646-2, as an addition to Annex A, as part of ballot comments to PDAM 1 to 10646-2: A.5 Supplementary Private Use Blocks The following blocks are specified in the Supplementary Planes designated by P=0F and P=10: Block name Positions Supplementary Private Use Area-A F0000-FFFFF Supplementary Private Use Area-B 100000-10FFFF *********************************************************** 2. Currently three BMP blocks in 10646 do not end on an even column value (xxxF). Those three block definitions are: HANGUL SYLLABLES AC00-D7A3 ARABIC PRESENTATION FORMS-B FE70-FEFE SPECIALS FFF0-FFFD Each gap has a different reason. The Hangul syllables block was defined as ending at D7A3 because this was the 11,172 johab set, inherently inextensible. However, the inextensibility issue has been dealt with alternatively by the later definition of fixed collections. See the fixed collection 71 HANGUL SYLLABLES AC00-D7A3. By way of contrast, the big Han character blocks were not terminated at the last assigned character, but instead rounded up to xxxF. For consistency, the Hangul syllables should be treated the same way. The block for Arabic presentation forms-B was terminated at FEFE because FEFF is the zero width no-break space, which was felt to be something entirely distinct, not to be included in the block. However, this block distinction is not actually honored in the printing of the charts for 10646, and trying to make it be so would just introduce complications and confusion to the standard. Hence, it makes more sense to just include FEFF in the block and be done with it, knowing that the blocks cannot be considered to be absolute guidelines to the characters they contain, in any case. The block for Specials, FFF0-FFFD, was terminated at FFFD, because FFFE and FFFF were not valid character codes. However, this is also inconsistent with the way the standard is actually printed. Furthermore, since noncharacter code points are now included within the Arabic Presentation Forms-A block, without a need to redefine the block itself around those noncharacter code points, the omission of the noncharacter code points FFFE and FFFF from the Specials block is also inconsistent in that way. For these reasons, and to bring the blocks into line with their main actual function, which is assisting in the printing of the charts for the standard, I suggest that the three blocks in question each be extended to the end of their respective columns. This could be formulated as a request for the following text changes to 10646-1, as part of PDAM 2 to 10646-1: In Annex A.2, redefine the end points of the three blocks which do not currently end on an even column, so that all block definitions end at xxxF values: Modify: HANGUL SYLLABLES AC00-D7A3 ARABIC PRESENTATION FORMS-B FE70-FEFE SPECIALS FFF0-FFFD To: HANGUL SYLLABLES AC00-D7AF ARABIC PRESENTATION FORMS-B FE70-FEFF SPECIALS FFF0-FFFF *********************************************************** 3. The inconsistent treatment of block definitions and fixed collection definitions between Hangul syllables and the unified CJK collections on the BMP should also be fixed. Hangul syllables were given a fixed collection, and the block for Hangul syllables was terminated at D7A3, not the end of a column. But CJK Unified Ideographs and CJK Unified Ideographs Extension A were given open collections, and their blocks were terminated at the ends of their columns (with several extra open columns, in the case of the CJK Unified Ideographs). In reality all three collections are clearly fixed collections. They were created as fixed repertoires, and cannot be extended by adding a few more at the end. And use of the open code positions at the ends of the last column of any of these blocks would *necessarily* also involve separate repertoire definition and separate collections -- they could not just be tacked on to the existing collections. I suggest that this reality be acknowledged by also correcting the collections for the two big CJK blocks to be fixed collections. This could be formulated as a request for the following text changes to 10646-1, as part of PDAM 2 to 10646-1: In Annex A.1, redefine the two big CJK Unified Ideograph collections to be fixed collections, as follows: Modify: 60 CJK UNIFIED IDEOGRAPHS 4E00-9FFF 81 CJK UNIFIED IDEOGRAPHS EXTENSION A 3400-4DBF To: 60 CJK UNIFIED IDEOGRAPHS 4E00-9FA5 * 81 CJK UNIFIED IDEOGRAPHS EXTENSION A 3400-4DB5 * 3