L2/01-422 Title: Response to L2/01-419 Block Boundary Fixes Author: Ken Whistler Date: October 30, 2001 Mark Davis has suggested a number of fixes to Blocks.txt, to eliminate some inconsistencies and to try to establish an invariant that all block boundaries end on an XXXF boundary. As usual, in all things Unicode-related, there are some worms (I'm not sure whether they should be considered big wriggly earthworms or just nematodes) in this can. So as a response to The Great Innovator (Mark), The Great Disinnovator (me), has assembled the analysis below of *all* anomalies in block names. These fall into 13 distinct types, for each of which I give a separate analysis and a suggested disposition. In some instances, I think Mark's suggestions are fine, but in other cases, I'd rather we left well-enough alone and abandoned the quest for the invariant. In any case, I'd like to get clear policy statements from the UTC on my suggested resolutions, so that after we deal with these wrigglers this time, we won't ever have to revisit them again. ================================================================ Complete list of discrepancies in blocks between NamesList.txt (used to print the Unicode and 10646 code charts), Blocks.txt, and the listing in Annexes A in 10646-1 and 10646-2, including the FDAM 1 for 10646-1. The listings below contain triples. The first line is the header entry from NamesList-3.2.0d2.txt, the second line is from Blocks-5d2.txt (both from the Unicode 3.2 BETA directory), and the third line is from Annex A of 10646-1 or 10646-2. Block-related items which match completely in name (except for casing) and in start and stop ranges are omitted -- except for Hangul Syllables, which is one where the ranges themselves are in contention. The listings are grouped by type, with my suggested resolution for each type of discrepancy. ================================================================ TYPE 1: 10646 block name contains a parenthetical addition to the name. 0250 IPA Extensions 02AF 0250..02AF; IPA Extensions IPA (INTERNATIONAL PHONETIC ALPHABET) EXTENSIONS 0250-02AF 3190 Kanbun 319F 3190..319F; Kanbun KANBUN (CJK miscellaneous) 3190-319F Analysis: The parenthetical additions are considered annotative in 10646, and otherwise these blocks match exactly in name and ranges. Suggested resolution: No action. ================================================================ TYPE 2: 10646 block name and ranges differ, for control codes in the 0000..00FF range. 0000 C0 Controls and Basic Latin (Basic Latin) 007F 0000..007F; Basic Latin BASIC LATIN 0020-007E 0080 C1 Controls and Latin-1 Supplement (Latin-1 Supplement) 00FF 0080..00FF; Latin-1 Supplement LATIN-1 SUPPLEMENT 00A0-00FF Analysis: This is a long-standing discrepancy, where 10646 formally excludes the control codes 0000..001F, 007F..009F from the block definitions (and collections), whereas the Unicode Standard has always included those ranges in Blocks.txt and in the code charts. The Unicode names list and Blocks.txt treatment are self-consistent in ranges. The Blocks.txt names match the 10646 block names, although the ranges differ. The parenthetical elements in the NamesList.txt entries are used to generate the correct block names when printing 10646 code charts. Suggested resolution: No action. Leave this sleeping dog alone. ================================================================ TYPE 3: 10646 block name differs, for a block which is not yet standardized in Unicode. 0500 Cyrillic Supplement 052F 0500..052F; Cyrillic Supplement CYRILLIC SUPPLEMENTARY 0500-052F Analysis: This is a result of what is probably an editorial oversight in 10646. The new *collection* is called "CYRILLIC SUPPLEMENT", but the new *block* is called "CYRILLIC SUPPLEMENTARY". Suggested resolution: Update Blocks.txt and NameList.txt to match 10646. Since these additions haven't even been approved yet, this can be considered simply a beta bug fix. Also submit an editorial correction to WG2, suggesting that the 10646 collection name be brought back in synch with the block name. ================================================================ TYPE 4: 10646 block name differs, for a block which is already standardized in Unicode. E000 Private Use Area F8FF E000..F8FF; Private Use PRIVATE USE AREA E000-F8FF Analysis: This has been out of synch for awhile. The names list entry doesn't actually get printed out anywhere, but matches the 10646 block name. Suggested resolution: Update Blocks.txt to use the same block name as 10646. This will have the advantage of making the normative block name match everybody's abbreviation of it as "PUA". ================================================================ TYPE 5: 10646 has no block; Unicode does, and the names and ranges are problematical in Unicode. FFF80 Private Use FFFFF F0000..FFFFD; Private Use [10646 ---] 10FF80 Private Use 10FFFF 100000..10FFFD; Private Use [10646 ---] Analysis: The shortened block ranges in Blocks.txt are the result of leaving noncharacters out of the ranges. But as Mark pointed out, now that we have a range of noncharacters *inside* a block (Arabic Presentation Forms-A), this is inconsistent. The anomalous start ranges for NamesList.txt are intentionally placed there, to trigger the printing of single 8-column code charts that display the two noncharacters in each range, without printing the rest of the entire plane. Suggested resolution: Coalesce these two ranges into a single range: F0000..10FFFF, and rename it "Private Use Area-A" or perhaps better: "Supplementary Private Use Area". For consistency between Unicode and 10646, and between 10646-1 and 10646-2, suggest additional text for PDAM 1 to 10646-2, so as to define the Supplementary Private Use Planes (Planes 0F and 10), and to add a collection and a block, consistent with the way the BMP PUA is handled in 10646-1. Note that if the 10646 editor thinks it will be problematical to try to define a single "block" in 10646 that spans two planes, it would be better for Unicode to just define two blocks: PUA-A and PUA-B, or whatever, so as not to keep propagating the inconsistencies. ================================================================ TYPE 6: 10646 has no block; Unicode does, and the names and ranges are *not* problematical in Unicode. D800 High Surrogates DB7F D800..DB7F; High Surrogates [10646 ---] DB80 High Private Use Surrogates DBFF DB80..DBFF; High Private Use Surrogates [10646 ---] DC00 Low Surrogates DFFF DC00..DFFF; Low Surrogates [10646 ---] Analysis: This is the result of the somewhat anomalous character model history of Unicode, where surrogate code units and surrogate code points have been conflated. The 10646 situation, with no blocks assigned to surrogate code points, is formally more correct, in my opinion, since the code points are unavailable for encoding of characters. However, the Unicode entries in Blocks.txt are longstanding and probably not problematical. The entries in NamesList.txt are not actually printed in code charts. Suggested resolution: No action. ================================================================ TYPE 7: Neither 10646 nor Unicode has a block, and anomalous start ranges appear in NamesList.txt. 1FF80 Unassigned 1FFFF 2FF80 Unassigned 2FFFF 3FF80 Unassigned 3FFFF 4FF80 Unassigned 4FFFF 5FF80 Unassigned 5FFFF 6FF80 Unassigned 6FFFF 7FF80 Unassigned 7FFFF 8FF80 Unassigned 8FFFF 9FF80 Unassigned 9FFFF AFF80 Unassigned AFFFF BFF80 Unassigned BFFFF CFF80 Unassigned CFFFF DFF80 Unassigned DFFFF EFF80 Unassigned EFFFF Analysis: These are all deliberate manipulations in NamesList.txt, to enable the printing out of single 8-column charts showing the noncharacters at the end of each plane, without printing out the entire planes. Suggested resolution: No action. ================================================================ TYPE 8: The block names in 10646 and Unicode match, for characters not yet published in Unicode, but there is a start range inconsistency between Blocks.txt and NamesList.txt. 27D0 Miscellaneous Mathematical Symbols-A 27EF 27C0..27EF; Miscellaneous Mathematical Symbols-A MISCELLANEOUS MATHEMATICAL SYMBOLS-A 27C0-27EF Analysis: This was the result of a misunderstanding about where the Miscellaneous Mathematical Symbols-A block started, since the characters were not encoded from the first column of the block, but from the second. This has already been fixed in Blocks.txt for the Unicode 3.2 beta, but has not yet been updated in NamesList.txt. Suggested resolution: Simply update the start range in NamesList.txt editorially. ================================================================ TYPE 9: The block name and range matches in 10646 and Unicode, but the end range is problematical FFF0 Specials FFFF FFF0..FFFD; Specials SPECIALS FFF0-FFFD Analysis: This is another example of the non-inclusion of noncharacters in the ranges for blocks. For consistency, the ranges should just be extended. Note that the end range in NamesList.txt was deliberate, to force the printing out of the two noncharacters in the chart. Suggested resolution: Approve extension of Specials block to FFF0..FFFF. For 10646, suggest additional text for PDAM 2 to extend the SPECIALS block to FFFF. ================================================================ TYPE 10: Anomalous handling of blocks involving the BOM. FE70 Arabic Presentation Forms-B FEFF FE70..FEFE; Arabic Presentation Forms-B ARABIC PRESENTATION FORMS-B FE70-FEFE [Namelist: no entry] FEFF..FEFF; Specials [10646 ---] Analysis: The NamesList.txt treatment is deliberate, to force the printing of the BOM on the correct code chart page. The Blocks.txt treatment was also a deliberate change from the earlier situation which had a discontiguous and overlapping definition of the Specials block. Cf. Blocks-1.txt (Unicode 2.0): FE70; FEFF; Arabic Presentation Forms-B FF00; FFEF; Halwidth and Fullwidth Forms FEFF; FEFF; Specials FFF0; FFFF; Specials Those "Specials" blocks were derived from book headers, which accounts for the FEFF, FFF0-FFFF definition. That was deliberately fixed in Blocks-2.txt (Unicode 2.1.9) to: FE70; FEFE; Arabic Presentation Forms-B FEFF; FEFF; Specials FF00; FFEF; Halfwidth and Fullwidth Forms FFF0; FFFD; Specials (at the request of Mark, by the way) to avoid blocks overlapping or being out of numerical order in Blocks.txt. Suggested resolution: Rename the block for BOM to "Specials-BOM" in Blocks.txt. Just live with the existence of this as a block in Unicode but not present in 10646, which talks about U+FEFF as a signature; alternatively, suggest text for PDAM 2 to add a "Specials-BOM" block to 10646. The alternative, of absorbing FEFF into the Arabic Presentation Forms-B block, has ramifications that are probably worse, since it would not only require changing a block boundary in 10646, but *also* would impact the collection for Arabic Presentation Forms-B, and would raise questions, since unlike the noncharacters, it would be adding an encoded character of completely different type to a long-standing collection in 10646. If we *must* have the block end on an F, then be prepared to provide all the detailed justification for suggested text in PDAM 2. ================================================================ TYPE 11: Block ranges match in Unicode and 10646, for blocks with generated character names, but NamesList.txt shows a mismatched range. 4E00 CJK Unified Ideographs 9FA5 4E00..9FFF; CJK Unified Ideographs CJK UNIFIED IDEOGRAPHS 4E00-9FFF Analysis: The range distinction in NamesList.txt is deliberate, to enable calculation of the cutoff point in the charts, where there are no actual character name entries in NamesList.txt to drive this. Suggested resolution: No action. ================================================================ TYPE 12: The end block range does not match in Unicode and 10646, for blocks with generated character names. 3400 CJK Unified Ideographs Extension A 4DB5 3400..4DB5; CJK Unified Ideographs Extension A CJK UNIFIED IDEOGRAPHS EXTENSION A 3400-4DBF 20000 CJK Unified Ideographs Extension B 2A6D6 20000..2A6D6; CJK Unified Ideographs Extension B CJK UNIFIED IDEOGRAPHS EXTENSION B 20000-2A6DF Analysis: These two instances are later additions to Blocks.txt (for Unicode 3.0 and Unicode 3.1, respectively), where the end range was figured based on the NamesList.txt treatment and Hangul Syllables, rather than matching the CJK Unified Ideographs block. Suggested resolution: Update the end range in Blocks.txt to match 10646. ================================================================ TYPE 13: Unicode and 10646 match in every respect, but Mark is suggesting that the blocks should be redefined to end at an F boundary. AC00 Hangul Syllables D7A3 AC00..D7A3; Hangul Syllables HANGUL SYLLABLES AC00-D7A3 Analysis: The Hangul Syllables block has always been defined this way, with the exact range of the 11,172 syllables. Since the 10646 collections are defined identically to the blocks for the Hangul and CJK ideograph collections, the CJK Unified Ideograph collections corresponding to the blocks are open collections, while the Hangul Syllables collection is a *fixed* collection. Suggested resolution: No action. The ramification of trying to extend the Hangul Syllables block in 10646 is that it will make a mismatch between the collection definition (which must stay fixed) and the block definition. It is not worth going there. 7