L2/01-422

Title: Response to L2/01-419 Block Boundary Fixes
Author: Ken Whistler
Date: October 30, 2001

Mark Davis has suggested a number of fixes to Blocks.txt, to
eliminate some inconsistencies and to try to establish an
invariant that all block boundaries end on an XXXF boundary.
As usual, in all things Unicode-related, there are some
worms (I'm not sure whether they should be considered big
wriggly earthworms or just nematodes) in this can.

So as a response to The Great Innovator (Mark), The
Great Disinnovator (me), has assembled the analysis below of
*all* anomalies in block names. These fall into 13 distinct
types, for each of which I give a separate analysis and
a suggested disposition.

In some instances, I think Mark's suggestions are fine, but
in other cases, I'd rather we left well-enough alone and
abandoned the quest for the invariant.

In any case, I'd like to get clear policy statements from
the UTC on my suggested resolutions, so that after we
deal with these wrigglers this time, we won't ever have
to revisit them again.

================================================================

Complete list of discrepancies in blocks between
NamesList.txt (used to print the Unicode and 10646 code
charts), Blocks.txt, and the listing in Annexes A in
10646-1 and 10646-2, including the FDAM 1 for 10646-1.

The listings below contain triples. The first line is
the header entry from NamesList-3.2.0d2.txt, the second line is
from Blocks-5d2.txt (both from the Unicode 3.2 BETA directory), 
and the third line is from Annex A of
10646-1 or 10646-2. Block-related items which match completely in
name (except for casing) and in start and stop ranges
are omitted -- except for Hangul Syllables, which is one
where the ranges themselves are in contention.

The listings are grouped by type, with my suggested resolution
for each type of discrepancy.

================================================================

TYPE 1: 10646 block name contains a parenthetical
addition to the name.

0250	IPA Extensions	02AF
0250..02AF; IPA Extensions
IPA (INTERNATIONAL PHONETIC ALPHABET) EXTENSIONS	0250-02AF

3190	Kanbun	319F
3190..319F; Kanbun
KANBUN (CJK miscellaneous)	3190-319F

Analysis: The parenthetical additions
are considered annotative in 10646, and otherwise these blocks
match exactly in name and ranges.

Suggested resolution: No action. 

================================================================

TYPE 2: 10646 block name and ranges differ, for control codes
in the 0000..00FF range.

0000	C0 Controls and Basic Latin (Basic Latin)	007F
0000..007F; Basic Latin
BASIC LATIN	0020-007E

0080	C1 Controls and Latin-1 Supplement (Latin-1 Supplement)	00FF
0080..00FF; Latin-1 Supplement
LATIN-1 SUPPLEMENT	00A0-00FF

Analysis: This is a long-standing discrepancy, where 10646
formally excludes the control codes 0000..001F, 007F..009F from
the block definitions (and collections), whereas the Unicode
Standard has always included those ranges in Blocks.txt and
in the code charts. The Unicode names list and Blocks.txt
treatment are self-consistent in ranges. The Blocks.txt
names match the 10646 block names, although the ranges differ.
The parenthetical elements in the NamesList.txt entries are
used to generate the correct block names when printing
10646 code charts.

Suggested resolution: No action. Leave this sleeping dog alone.

================================================================

TYPE 3: 10646 block name differs, for a block which is not
yet standardized in Unicode.

0500	Cyrillic Supplement	052F
0500..052F; Cyrillic Supplement
CYRILLIC SUPPLEMENTARY	0500-052F

Analysis: This is a result of what is probably an editorial
oversight in 10646. The new *collection* is called
"CYRILLIC SUPPLEMENT", but the new *block* is called
"CYRILLIC SUPPLEMENTARY".

Suggested resolution: Update Blocks.txt and NameList.txt to
match 10646. Since these additions haven't even been approved
yet, this can be considered simply a beta bug fix. Also
submit an editorial correction to WG2, suggesting that
the 10646 collection name be brought back in synch with the
block name.

================================================================

TYPE 4: 10646 block name differs, for a block which is
already standardized in Unicode.

E000	Private Use Area	F8FF
E000..F8FF; Private Use
PRIVATE USE AREA	E000-F8FF

Analysis: This has been out of synch for awhile. The names
list entry doesn't actually get printed out anywhere, but
matches the 10646 block name.

Suggested resolution: Update Blocks.txt to use the same
block name as 10646. This will have the advantage of making
the normative block name match everybody's abbreviation
of it as "PUA".

================================================================

TYPE 5: 10646 has no block; Unicode does, and the names and
ranges are problematical in Unicode.

FFF80	Private Use	FFFFF
F0000..FFFFD; Private Use
[10646 ---]

10FF80	Private Use	10FFFF
100000..10FFFD; Private Use
[10646 ---]

Analysis: The shortened block ranges in Blocks.txt are the
result of leaving noncharacters out of the ranges. But as
Mark pointed out, now that we have a range of noncharacters
*inside* a block (Arabic Presentation Forms-A), this is
inconsistent. The anomalous start ranges for NamesList.txt
are intentionally placed there, to trigger the printing
of single 8-column code charts that display the two
noncharacters in each range, without printing the rest
of the entire plane.

Suggested resolution: Coalesce these two ranges into a
single range: F0000..10FFFF, and rename it "Private Use Area-A"
or perhaps better: "Supplementary Private Use Area".
For consistency between Unicode and 10646, and between
10646-1 and 10646-2, suggest additional text for PDAM 1
to 10646-2, so as to define the Supplementary
Private Use Planes (Planes 0F and 10), and to add a
collection and a block, consistent with the way the BMP
PUA is handled in 10646-1. Note that if the 10646 editor
thinks it will be problematical to try to define a single
"block" in 10646 that spans two planes, it would be better
for Unicode to just define two blocks: PUA-A and PUA-B,
or whatever, so as not to keep propagating the inconsistencies.

================================================================

TYPE 6: 10646 has no block; Unicode does, and the names and
ranges are *not* problematical in Unicode.

D800	High Surrogates	DB7F
D800..DB7F; High Surrogates
[10646 ---]

DB80	High Private Use Surrogates	DBFF
DB80..DBFF; High Private Use Surrogates
[10646 ---]

DC00	Low Surrogates	DFFF
DC00..DFFF; Low Surrogates
[10646 ---]

Analysis: This is the result of the somewhat anomalous character
model history of Unicode, where surrogate code units and
surrogate code points have been conflated. The 10646 situation,
with no blocks assigned to surrogate code points, is formally
more correct, in my opinion, since the code points are
unavailable for encoding of characters. However, the Unicode
entries in Blocks.txt are longstanding and probably not
problematical. The entries in NamesList.txt are not actually
printed in code charts.

Suggested resolution: No action.

================================================================

TYPE 7: Neither 10646 nor Unicode has a block, and anomalous
start ranges appear in NamesList.txt.

1FF80	Unassigned	1FFFF
2FF80	Unassigned	2FFFF
3FF80	Unassigned	3FFFF
4FF80	Unassigned	4FFFF
5FF80	Unassigned	5FFFF
6FF80	Unassigned	6FFFF
7FF80	Unassigned	7FFFF
8FF80	Unassigned	8FFFF
9FF80	Unassigned	9FFFF
AFF80	Unassigned	AFFFF
BFF80	Unassigned	BFFFF
CFF80	Unassigned	CFFFF
DFF80	Unassigned	DFFFF
EFF80	Unassigned	EFFFF

Analysis: These are all deliberate manipulations in NamesList.txt,
to enable the printing out of single 8-column charts showing the
noncharacters at the end of each plane, without printing out
the entire planes.

Suggested resolution: No action.

================================================================

TYPE 8: The block names in 10646 and Unicode match, for
characters not yet published in Unicode, but there is a
start range inconsistency between Blocks.txt and NamesList.txt.

27D0	Miscellaneous Mathematical Symbols-A	27EF
27C0..27EF; Miscellaneous Mathematical Symbols-A
MISCELLANEOUS MATHEMATICAL SYMBOLS-A	27C0-27EF

Analysis: This was the result of a misunderstanding about
where the Miscellaneous Mathematical Symbols-A block
started, since the characters were not encoded from the
first column of the block, but from the second. This
has already been fixed in Blocks.txt for the Unicode 3.2
beta, but has not yet been updated in NamesList.txt.

Suggested resolution: Simply update the start range in
NamesList.txt editorially.

================================================================

TYPE 9: The block name and range matches in 10646 and
Unicode, but the end range is problematical

FFF0	Specials	FFFF
FFF0..FFFD; Specials
SPECIALS	FFF0-FFFD

Analysis: This is another example of the non-inclusion of
noncharacters in the ranges for blocks. For consistency,
the ranges should just be extended. Note that the
end range in NamesList.txt was deliberate, to force the
printing out of the two noncharacters in the chart.

Suggested resolution: Approve extension of Specials block
to FFF0..FFFF. For 10646, suggest additional text for
PDAM 2 to extend the SPECIALS block to FFFF.

================================================================

TYPE 10: Anomalous handling of blocks involving the BOM.

FE70	Arabic Presentation Forms-B	FEFF
FE70..FEFE; Arabic Presentation Forms-B
ARABIC PRESENTATION FORMS-B	FE70-FEFE

[Namelist: no entry]
FEFF..FEFF; Specials
[10646 ---]

Analysis: The NamesList.txt treatment is deliberate, to
force the printing of the BOM on the correct code chart
page. The Blocks.txt treatment was also a deliberate
change from the earlier situation which had a discontiguous
and overlapping definition of the Specials block. Cf.
Blocks-1.txt (Unicode 2.0):

FE70; FEFF; Arabic Presentation Forms-B
FF00; FFEF; Halwidth and Fullwidth Forms
FEFF; FEFF; Specials
FFF0; FFFF; Specials

Those "Specials" blocks were derived from book headers, which
accounts for the FEFF, FFF0-FFFF definition. That was deliberately
fixed in Blocks-2.txt (Unicode 2.1.9) to:

FE70; FEFE; Arabic Presentation Forms-B
FEFF; FEFF; Specials
FF00; FFEF; Halfwidth and Fullwidth Forms
FFF0; FFFD; Specials

(at the request of Mark, by the way) to avoid blocks overlapping
or being out of numerical order in Blocks.txt.

Suggested resolution: Rename the block for BOM to "Specials-BOM"
in Blocks.txt. Just live with the existence of this as a block
in Unicode but not present in 10646, which talks about U+FEFF
as a signature; alternatively, suggest text for PDAM 2 to
add a "Specials-BOM" block to 10646. The alternative, of
absorbing FEFF into the Arabic Presentation Forms-B block, has
ramifications that are probably worse, since it would not only require
changing a block boundary in 10646, but *also* would impact the
collection for Arabic Presentation Forms-B, and would raise
questions, since unlike the noncharacters, it would be adding
an encoded character of completely different type to a
long-standing collection in 10646. If we *must* have the
block end on an F, then be prepared to provide all the detailed
justification for suggested text in PDAM 2. 

================================================================

TYPE 11: Block ranges match in Unicode and 10646, for
blocks with generated character names, but NamesList.txt
shows a mismatched range.

4E00	CJK Unified Ideographs	9FA5
4E00..9FFF; CJK Unified Ideographs
CJK UNIFIED IDEOGRAPHS	4E00-9FFF

Analysis: The range distinction in NamesList.txt is deliberate,
to enable calculation of the cutoff point in the charts,
where there are no actual character name entries in NamesList.txt
to drive this.

Suggested resolution: No action.

================================================================

TYPE 12: The end block range does not match in Unicode and
10646, for blocks with generated character names.

3400	CJK Unified Ideographs Extension A	4DB5
3400..4DB5; CJK Unified Ideographs Extension A
CJK UNIFIED IDEOGRAPHS EXTENSION A	3400-4DBF

20000	CJK Unified Ideographs Extension B	2A6D6
20000..2A6D6; CJK Unified Ideographs Extension B
CJK UNIFIED IDEOGRAPHS EXTENSION B	20000-2A6DF

Analysis: These two instances are later additions to
Blocks.txt (for Unicode 3.0 and Unicode 3.1, respectively),
where the end range was figured based on the NamesList.txt
treatment and Hangul Syllables, rather than matching
the CJK Unified Ideographs block.

Suggested resolution: Update the end range in Blocks.txt
to match 10646.

================================================================

TYPE 13: Unicode and 10646 match in every respect, but
Mark is suggesting that the blocks should be redefined to
end at an F boundary.

AC00	Hangul Syllables	D7A3
AC00..D7A3; Hangul Syllables
HANGUL SYLLABLES	AC00-D7A3

Analysis: The Hangul Syllables block has always been defined
this way, with the exact range of the 11,172 syllables.
Since the 10646 collections are defined identically to
the blocks for the Hangul and CJK ideograph collections,
the CJK Unified Ideograph collections corresponding to the
blocks are open collections, while the Hangul Syllables
collection is a *fixed* collection.

Suggested resolution: No action. The ramification of trying
to extend the Hangul Syllables block in 10646 is that it
will make a mismatch between the collection definition (which
must stay fixed) and the block definition. It is not worth
going there.


	7