From: Kenneth Whistler (firstname.lastname@example.org)
Date: Mon May 07 2007 - 18:46:16 CDT
> > 1) Is ôcharacter rangeö or ôcharacter blockö the preferred term now?
> In Unicode, a block is a named entity associated with a range of
> characters that is an integral multiple of 16.
> That should provide the relation between these two terms. A 256
> character range inside the Unified CJK Ideographs block, for example, is
> not a block. (In 10646 it's called a 'row', if aligned on even 256
> boundaries, but that's not a widely understood term out of context).
Refining a little bit on Asmus' definitions:
A Unicode block is a named entity associated with a range of *code points*
that is an integral multiple of 16.
You need to specify it that way, because a Unicode block can and often
does contain unassigned (= reserved) code points, and may, in some
instances, even contain noncharacters.
The exact list of blocks is specified normatively in the UCD file,
Blocks.txt. (Or you can see a comparable listing in Annex A of
Another way of thinking about it is that a block is a named entity
consisting of a contiguous range of columns, where a column is
Column: a range of 16 code points XXX0..XXXF
"Column" isn't a normative term in either 10646 or the Unicode
Standard, but is still a useful concept because it is so visible
in the code charts.
In the 10646 context, the following terms are also commonly used (these
are my definitions, not normative definition in the standard):
Row: a range of 256 code points XX00..XXFF
Plane: a range of 64K code points X0000..XFFFF
For comparison, here are the normative 10646 definitions:
Row: A subdivision of a plane; of 256 cells.
Plane: A subdivision of a group; of 256 x 256 cells.
The Unicode Standard has adopted the term "plane" but
doesn't make any regular use of the "row" term.
On the other hand, the Unicode Standard makes use of the term "range"
in its normal mathematical sense, and it can be used to specify any
ad hoc listing of code points with a start and a stop point.
For example, it is perfectly o.k. to talk about a character
range, U+FFFE..U+10001, even though that particular range happens
to span a column break, a row break, and a plane break, and also
incorporates characters (and noncharacters) from two different blocks.
One of the reasons why the Unicode Standard has generally moved away
from talking too much about "Unicode character blocks", despite their
normative status in the standard, is that they do not correlate
well with script identity. There are a number of instances where
a script is split across more than one block (Latin, Cyrillic, etc.),
and there are instances where more than one script is contained within
a single block (Greek and Coptic).
People unfamiliar with the standard are likely to expect that if
one talks about "the Ethiopic block", for example, that:
A. It will contain all the Ethiopic characters.
B. It will be a "block" in the sense Doug talked about, i.e.
a "code page" like chunk of 256 characters 00..FF (or a
"row" in 10646 parlance).
C. It contains no characters used by other script.
C happens to be true in this case, but A and B are not, because
there are also Ethiopic characters in another supplemental block,
and because the range of the Ethiopic block is 1200..137F.
Interestingly, because the Ethiopic Supplement block was added
contiguous to the Ethiopic block, the range of Ethiopic characters
is a contiguous range, 1200..139F, even though that spans two blocks.
This archive was generated by hypermail 2.1.5 : Mon May 07 2007 - 18:47:56 CDT