L2/01-419
From: Mark Davis
Sent: Tuesday, October 30, 2001
Subject: UTC Agenda: Block Boundary Fixes
When looking at the Blocks.txt lately, I noticed some oddities. Since blocks
are simply a way of organizing blocks of code points, fixing these would
make it easier for implementations and more consistent. (Of course, we don't
recommend that people depend on blocks anyway: within blocks code points
must be tested for whether they are assigned to characters or not, and what
type of characters they are. If implementations they don't do that, they
will get a great many incorrect results!)
1. Currently we have three discontiguous blocks.
FEFF..FEFF; Specials
FFF0..FFFD; Specials
E000..F8FF; Private Use
F0000..FFFFD; Private Use
100000..10FFFD; Private Use
In implementations, this discontinuity is clumsy. This situation is also
anomalous: in all other cases we ensure that there are no discontiguous
blocks by adding a letter suffix, such as:
FB50..FDFF; Arabic Presentation Forms-A
FE70..FEFE; Arabic Presentation Forms-B
2. There are a few blocks that don't contain complete columns. That is, the
range is not of the form: xxxxx0..yyyyyF. Having all blocks end on a full
column makes behavior more uniform, and has fewer exceptions for
implementations.
These uncolumnated blocks fall into two groups:
Group 1:
3400..4DB5; CJK Unified Ideographs Extension A
AC00..D7A3; Hangul Syllables
20000..2A6D6; CJK Unified Ideographs Extension B
These are also inconsistent with other blocks. Many blocks, including CJK
Unified Ideographs, finish with a full column even though some of the code
points at the end are unused. Note: in some cases here we are inconsistent
with 10646, since it ends with F for some of these.
Group 2:
FE70..FEFE; Arabic Presentation Forms-B
FFF0..FFFD; Specials
F0000..FFFFD; Private Use
100000..10FFFD; Private Use
This group is uncolumnated because of noncharacters or the BOM. Yet we have
wierd codepoints in other blocks, and now noncharacters in the middle of
other blocks, so shouldn't stop us from fixing them. For example:
FDD0..FDEF ; Noncharacter_Code_Point # Cn [32]
are in the middle of:
FB50..FDFF; Arabic Presentation Forms-A
PROPOSAL
Address these issues by changing the blocks in the following ways, and
communicating a request to WG2 for them to do the same. Note: I don't have
10646 at hand, but I believe that they even extend some of these blocks to a
row boundary (xxxx7F or xxxxFF). Where they do, we should simply follow
their lead.
3400..4DB5; CJK Unified Ideographs Extension A
AC00..D7A3; Hangul Syllables
FE70..FEFE; Arabic Presentation Forms-B
FEFF..FEFF; Specials
FFF0..FFFD; Specials
20000..2A6D6; CJK Unified Ideographs Extension B
F0000..FFFFD; Private Use
100000..10FFFD; Private Use
3400..4DBF; CJK Unified Ideographs Extension A
AC00..D7AF; Hangul Syllables
FE70..FEFF; Arabic Presentation Forms-B
FFF0..FFFF; Specials
20000..2A6DF; CJK Unified Ideographs Extension B
F0000..10FFFF; Private Use-A