L2/14-155 Title: Defining "Block" Author: Ken Whistler Date: July 18, 2014 Status: For consideration by UTC Background Recently the editorial committee ended up wrestling with the problem of updating the online glossary entry for "block", in response to some comments that came up on the discussion lists. In the process it became clear to me that we had a problem of definition here, because there was a lot of tugging back and forth about the glossary entry and how it should be worded. It seemed to me that the problem stemmed partly from trying to stack the need for a clear definition on top of additional information about process concerns (how the committees define new blocks, for example), stability concerns (whether block boundaries can change, for example), and information *about* blocks (whether they can overlap, what they can contain, etc.). In this document I propose that the UTC address this issue first by adding a formal definition of "block" to the standard, and then add various other clarifications in some other documents, data files, and/or parts of the core specification which mention blocks. At that point the editors will be on better ground to figure out how best to update the online glossary entry to make it consistent with those updates. Definition of "Block" First, I suggest that we add a formal definition of "block" to the definitions in Chapter 3, Conformance, of the core specification. Perhaps the best place to insert this new definition would be right after the definition of "Code point". I think the best policy here will be to keep the *definition* per se simple, but then to spell out specifically all the additional required information about blocks in the Unicode Standard as explicit bullet points following the definition. That prevents the definition itself from being of the unwieldy nature "D which is X and has Y and also has Z and which ..." So I suggest: ================================================================= D10b Block: A named range of code points * The exact list of blocks defined for each version of the Unicode Standard is specified by the data file Blocks.txt in the Unicode Character Database. * The range for each defined block is specified by field 0 in Blocks.txt. For example: "0000..007F" * The ranges for blocks are non-overlapping. In other words, no code point can be contained in each of the ranges for two distinct blocks. * The range for each block is defined as a contiguous sequence. In other words, a block cannot consist of two (or more) discontiguous sequences of code points. * Each range for a defined block starts with a value for which cp MOD 16 = 0 and terminates with a (larger) value for which cp MOD 16 = 15. This specification results in block ranges which always include full code point columns for code chart display. A block never starts or terminates in mid-column. * All assigned characters are contained within the ranges for defined blocks. * Blocks may contain reserved code points. However, no block contains only reserved code points. The majority of reserved code points are outside the ranges of defined blocks. * A few designated code points are not contained within the ranges for defined blocks. This applies to the noncharacter code points at the last two code points of supplementary planes 1 through 14. * The name for each defined block is specified by field 1 in Blocks.txt. For example: "Basic Latin" * The names for defined blocks constitute a unique namespace. * The uniqueness rule for the block namespace is LM3, as defined in UAX #44, Unicode Character Database. In other words, when matching strings for block names, casing, white space, hyphens, and underscores are ignored. The string "BASIC LATIN" or "Basic_Latin" would be considered as matching the name for the block named "Basic Latin". * There is also a normative Block property. See Table 3-2, Normative Character Properties. The Block property is a catalog property whose value is a string that identifies a block. * Property value aliases for the Block property are defined in PropertyValueAliases.txt in the Unicode Character Database. The long alias defined for the Block property is always a loose match for the name of the block defined in Blocks.txt. Additional short aliases and other aliases are provided for convenience of use in regular expression syntax. * The default value for the Block propety is "No_Block". This default applies to any code point which is not contained in the range of a defined block. * For a general discussion of blocks and their relation to allocation in the Unicode Standard, see "Allocation Areas and Character Blocks" in Section 2.8, Unicode Allocation. * For a general discussion of the use of blocks in the presentation of the Unicode code charts, see Chapter 24, About the Code Charts. ================================================================= I think that basically covers all that would reasonably be considered truly definitional and required by the specification per se. Other Adjustments The discussion in Section 2.8 already covers the issue of what kinds of character end up allocated in blocks -- i.e. that a block doesn't necessary contain only a single script, that a script may be spread across several blocks, and so forth. That kind of information doesn't need to be duplicated in the *definition* section. But it should be reviewed for consistency with the new definition and bullet points. And it probably makes sense to spend a little time there explaining the use of (proposed) blocks in the Roadmap as part of the planning process for eventual allocation. The discussion in Chapter 24 can be enhanced a bit to talk about the function of the blocks in laying out charts. There is already some information there, but it could be expanded a little to explain some of the tweaks involved in the use of blocks for chart layout by the Unibook tool. It might make sense to add in Appendix C a short note about the synchronization between blocks as defined in the Unicode Standard and the blocks defined in Annex A of 10646. Oonce we come to an agreement about the exact text for the definition in Chapter 3, there would be some additional minor work advisable for the comments section in Blocks.txt and for the discussion of blocks in UAX #44, to ensure that the wording is consistent as of Unicode 8.0 and includes the appropriate back pointers to the new definition in Chapter 3. UAX #44 is probably the appropriate vehicle for mentioning some of the discontinuities in block handling, such as the version of the standard where we officially regularized all the block ranges to end on column boundaries. At that point, we could finally go back and do the necessary editorial updates to the FAQ and glossary pages, to make sure they match the content in the standard and explain it correctly. And that would also be the appropriate time to consider whether anything regarding the current handling of blocks by the UTC should be addressed by further stability policy guarantees, or whether status quo is appropriate.