L2/01-419 From: Mark Davis Sent: Tuesday, October 30, 2001 Subject: UTC Agenda: Block Boundary Fixes When looking at the Blocks.txt lately, I noticed some oddities. Since blocks are simply a way of organizing blocks of code points, fixing these would make it easier for implementations and more consistent. (Of course, we don't recommend that people depend on blocks anyway: within blocks code points must be tested for whether they are assigned to characters or not, and what type of characters they are. If implementations they don't do that, they will get a great many incorrect results!) 1. Currently we have three discontiguous blocks. FEFF..FEFF; Specials FFF0..FFFD; Specials E000..F8FF; Private Use F0000..FFFFD; Private Use 100000..10FFFD; Private Use In implementations, this discontinuity is clumsy. This situation is also anomalous: in all other cases we ensure that there are no discontiguous blocks by adding a letter suffix, such as: FB50..FDFF; Arabic Presentation Forms-A FE70..FEFE; Arabic Presentation Forms-B 2. There are a few blocks that don't contain complete columns. That is, the range is not of the form: xxxxx0..yyyyyF. Having all blocks end on a full column makes behavior more uniform, and has fewer exceptions for implementations. These uncolumnated blocks fall into two groups: Group 1: 3400..4DB5; CJK Unified Ideographs Extension A AC00..D7A3; Hangul Syllables 20000..2A6D6; CJK Unified Ideographs Extension B These are also inconsistent with other blocks. Many blocks, including CJK Unified Ideographs, finish with a full column even though some of the code points at the end are unused. Note: in some cases here we are inconsistent with 10646, since it ends with F for some of these. Group 2: FE70..FEFE; Arabic Presentation Forms-B FFF0..FFFD; Specials F0000..FFFFD; Private Use 100000..10FFFD; Private Use This group is uncolumnated because of noncharacters or the BOM. Yet we have wierd codepoints in other blocks, and now noncharacters in the middle of other blocks, so shouldn't stop us from fixing them. For example: FDD0..FDEF ; Noncharacter_Code_Point # Cn [32] are in the middle of: FB50..FDFF; Arabic Presentation Forms-A PROPOSAL Address these issues by changing the blocks in the following ways, and communicating a request to WG2 for them to do the same. Note: I don't have 10646 at hand, but I believe that they even extend some of these blocks to a row boundary (xxxx7F or xxxxFF). Where they do, we should simply follow their lead. 3400..4DB5; CJK Unified Ideographs Extension A AC00..D7A3; Hangul Syllables FE70..FEFE; Arabic Presentation Forms-B FEFF..FEFF; Specials FFF0..FFFD; Specials 20000..2A6D6; CJK Unified Ideographs Extension B F0000..FFFFD; Private Use 100000..10FFFD; Private Use 3400..4DBF; CJK Unified Ideographs Extension A AC00..D7AF; Hangul Syllables FE70..FEFF; Arabic Presentation Forms-B FFF0..FFFF; Specials 20000..2A6DF; CJK Unified Ideographs Extension B F0000..10FFFF; Private Use-A