Re: Ranges/blocks ; font lookup by range

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon May 07 2007 - 18:46:16 CDT

  • Next message: Doug Ewell: "Re: Uppercase ß is coming? (U+1E9E)"

    > > 1) Is “character range” or “character block” the preferred term now?

    > In Unicode, a block is a named entity associated with a range of
    > characters that is an integral multiple of 16.
    > That should provide the relation between these two terms. A 256
    > character range inside the Unified CJK Ideographs block, for example, is
    > not a block. (In 10646 it's called a 'row', if aligned on even 256
    > boundaries, but that's not a widely understood term out of context).

    Refining a little bit on Asmus' definitions:

    A Unicode block is a named entity associated with a range of *code points*
    that is an integral multiple of 16.

    You need to specify it that way, because a Unicode block can and often
    does contain unassigned (= reserved) code points, and may, in some
    instances, even contain noncharacters.

    The exact list of blocks is specified normatively in the UCD file,
    Blocks.txt. (Or you can see a comparable listing in Annex A of
    10646.)

    Another way of thinking about it is that a block is a named entity
    consisting of a contiguous range of columns, where a column is
    defined as:

    Column: a range of 16 code points XXX0..XXXF

    "Column" isn't a normative term in either 10646 or the Unicode
    Standard, but is still a useful concept because it is so visible
    in the code charts.

    In the 10646 context, the following terms are also commonly used (these
    are my definitions, not normative definition in the standard):

    Row: a range of 256 code points XX00..XXFF

    Plane: a range of 64K code points X0000..XFFFF

    For comparison, here are the normative 10646 definitions:

    Row: A subdivision of a plane; of 256 cells.

    Plane: A subdivision of a group; of 256 x 256 cells.

    The Unicode Standard has adopted the term "plane" but
    doesn't make any regular use of the "row" term.

    On the other hand, the Unicode Standard makes use of the term "range"
    in its normal mathematical sense, and it can be used to specify any
    ad hoc listing of code points with a start and a stop point.
    For example, it is perfectly o.k. to talk about a character
    range, U+FFFE..U+10001, even though that particular range happens
    to span a column break, a row break, and a plane break, and also
    incorporates characters (and noncharacters) from two different blocks.

    One of the reasons why the Unicode Standard has generally moved away
    from talking too much about "Unicode character blocks", despite their
    normative status in the standard, is that they do not correlate
    well with script identity. There are a number of instances where
    a script is split across more than one block (Latin, Cyrillic, etc.),
    and there are instances where more than one script is contained within
    a single block (Greek and Coptic).

    People unfamiliar with the standard are likely to expect that if
    one talks about "the Ethiopic block", for example, that:

      A. It will contain all the Ethiopic characters.
      B. It will be a "block" in the sense Doug talked about, i.e.
         a "code page" like chunk of 256 characters 00..FF (or a
         "row" in 10646 parlance).
      C. It contains no characters used by other script.
      
    C happens to be true in this case, but A and B are not, because
    there are also Ethiopic characters in another supplemental block,
    and because the range of the Ethiopic block is 1200..137F.

    Interestingly, because the Ethiopic Supplement block was added
    contiguous to the Ethiopic block, the range of Ethiopic characters
    is a contiguous range, 1200..139F, even though that spans two blocks.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon May 07 2007 - 18:47:56 CDT