L2/14-155

Title:  Defining "Block"
Author: Ken Whistler
Date:   July 18, 2014
Status: For consideration by UTC

Background

Recently the editorial committee ended up wrestling with the problem
of updating the online glossary entry for "block", in response to
some comments that came up on the discussion lists. In the process
it became clear to me that we had a problem of definition here,
because there was a lot of tugging back and forth about the
glossary entry and how it should be worded. It seemed to me that
the problem stemmed partly from trying to stack the need for a clear
definition on top of additional information about process concerns
(how the committees define new blocks, for example), stability
concerns (whether block boundaries can change, for example),
and information *about* blocks (whether they can overlap, what
they can contain, etc.).

In this document I propose that the UTC address this issue first
by adding a formal definition of "block" to the standard, and
then add various other clarifications in some other documents,
data files, and/or parts of the core specification which mention blocks.
At that point the editors will be on better ground to figure out how best
to update the online glossary entry to make it consistent with
those updates.

Definition of "Block"

First, I suggest that we add a formal definition of "block" to
the definitions in Chapter 3, Conformance, of the core specification.
Perhaps the best place to insert this new definition would 
be right after the definition of "Code point". I think the best
policy here will be to keep the *definition* per se simple, but
then to spell out specifically all the additional required information
about blocks in the Unicode Standard as explicit bullet points
following the definition. That prevents the definition itself
from being of the unwieldy nature "D which is X and has Y and
also has Z and which ..." So I suggest:

=================================================================

D10b Block: A named range of code points

* The exact list of blocks defined for each version of the Unicode
Standard is specified by the data file Blocks.txt in the Unicode
Character Database.

* The range for each defined block is specified by field 0 in
Blocks.txt. For example: "0000..007F"

* The ranges for blocks are non-overlapping. In other words, no
code point can be contained in each of the ranges for two distinct
blocks.

* The range for each block is defined as a contiguous sequence.
In other words, a block cannot consist of two (or more) discontiguous
sequences of code points.

* Each range for a defined block starts with a value for which
cp MOD 16 = 0 and terminates with a (larger) value for which
cp MOD 16 = 15. This specification results in block ranges which
always include full code point columns for code chart display.
A block never starts or terminates in mid-column.

* All assigned characters are contained within the ranges for
defined blocks.

* Blocks may contain reserved code points. However, no block contains
<i>only</i> reserved code points. The majority of reserved code
points are outside the ranges of defined blocks.

* A few designated code points are not contained within the ranges
for defined blocks. This applies to the noncharacter code points
at the last two code points of supplementary planes 1 through 14.

* The name for each defined block is specified by field 1 in
Blocks.txt. For example: "Basic Latin"

* The names for defined blocks constitute a unique namespace.

* The uniqueness rule for the block namespace is LM3, as defined
in UAX #44, Unicode Character Database. In other words, when
matching strings for block names, casing, white space, hyphens,
and underscores are ignored. The string "BASIC LATIN" or
"Basic_Latin" would be considered as matching the name for
the block named "Basic Latin".

* There is also a normative Block <i>property</i>. See Table
3-2, Normative Character Properties. The Block property is
a catalog property whose value is a string that identifies
a block.

* Property value aliases for the Block
property are defined in PropertyValueAliases.txt in the Unicode
Character Database. The long alias defined for the Block property is 
always a loose match for the name of the block defined in Blocks.txt.
Additional short aliases and other aliases are provided for
convenience of use in regular expression syntax.

* The default value for the Block propety is "No_Block". This
default applies to any code point which is not contained in the
range of a defined block.

* For a general discussion of blocks and their relation to
allocation in the Unicode Standard, see "Allocation Areas
and Character Blocks" in Section 2.8, Unicode Allocation.

* For a general discussion of the use of blocks in the presentation
of the Unicode code charts, see Chapter 24, About the Code Charts.

=================================================================

I think that basically covers all that would reasonably be considered
truly definitional and required by the specification per se.

Other Adjustments

The discussion in Section 2.8 already covers the issue of what
kinds of character end up allocated in blocks -- i.e. that
a block doesn't necessary contain only a single script, that
a script may be spread across several blocks, and so forth. That
kind of information doesn't need to be duplicated in the
*definition* section. But it should be reviewed for
consistency with the new definition and bullet points.
And it probably makes sense to spend a little time there
explaining the use of (proposed) blocks in the Roadmap
as part of the planning process for eventual allocation.

The discussion in Chapter 24 can be enhanced a bit to talk
about the function of the blocks in laying out charts. There is
already some information there, but it could be expanded a little
to explain some of the tweaks involved in the use of blocks for
chart layout by the Unibook tool.

It might make sense to add in Appendix C a short note about
the synchronization between blocks as defined in the Unicode
Standard and the blocks defined in Annex A of 10646.

Oonce we come to an agreement about the exact text for
the definition in Chapter 3, there would be some additional minor
work advisable for the comments section in Blocks.txt and for
the discussion of blocks in UAX #44, to ensure that the wording
is consistent as of Unicode 8.0 and includes the appropriate
back pointers to the new definition in Chapter 3. UAX #44 is
probably the appropriate vehicle for mentioning some of the
discontinuities in block handling, such as the version of
the standard where we officially regularized all the block
ranges to end on column boundaries.

At that point, we could finally go back and do the necessary
editorial updates to the FAQ and glossary pages, to make sure
they match the content in the standard and explain it correctly.

And that would also be the appropriate time to consider whether
anything regarding the current handling of blocks by the
UTC should be addressed by further stability policy guarantees,
or whether status quo is appropriate.