Re: holes (unassigned code points) in the code charts

From: Asmus Freytag <>
Date: Fri, 04 Jan 2013 03:43:05 -0800

On 1/4/2013 2:36 AM, Stephan Stiller wrote:
> All,
> There are plenty of unassigned code points within blocks that are in
> use; these often come at the end of a block but there are plenty of
> holes as well.
> I have a cluster of interrelated questions:
> 1. What sorts of reasons are there (or have there been) for leaving
> holes? Code page conversion and changes to casing by simple
> arithmetic? What else?

There are a number of reasons why a code chart may not be contiguous
besides the reason you give. Sometimes, a character gets removed from
the draft at last minute, In those cases, a hole may be left. In
general, the possible reasons for leaving a hole can not be enumerated
in a fixed list. It's more of a case-by-case thing.
> 1.1 The rationale for particular holes is not documented in the code
> charts I looked at; is there documentation? (Yes, in some instances
> the answer can be guessed.)

In general, no. Sometimes, there's explanation in the text.
> 1.2 How is the number of holes determined? It seems like multiples of
> 16 are used for block sizes merely for practical reasons.
Blocks end on a value ending in "F" in hexadecimal notation.
> 2. I notice that ranges are often used to describe where scripts are
> found. Do holes have properties? Are the other block-related policies
> that gives holes a certain semantics?

There are default values for some properties that can be applied to
unassigned characters in order to make an algorithm "do the best" with
as-yet-unassigned characters (so that if a new character is created, the
algorithm doesn't have to be reimplemented necessarily but still gives
good results).

There's no distinction between "holes" and other unassigned characters.
> 2.1 If not, how likely is it that Unicode assigns script-external
> characters to holes?

It's generally not desirable, but there's no firm policy that blocks
must have a single script value (and in fact, no such restriction exists
in existing blocks).
> 2.2 If yes, how does the number of assigned code points differ, if
> holes that are assumed to be filled only by certain types of
> characters are counted?

> 2.2.1 Would this make much of a difference wrt the question (this
> comes up from time to time it seems) of how much of Unicode will
> eventually fill up?

If strong technical reasons exist for placing a character into the BMP,
there will be temptation to fill a "hole" if the BMP is otherwise full.
Likewise, many, many years (decades) from now, similar pressure might
exist should the rest of the code space become filled.

However, the most likely scenario is that Unicode will continue for an
indefinite period with sufficient "open" space (and the occasional hole).
> 3. Have there been "mistakes" wrt to hole assignment?

Unicode doesn't make mistakes. :)

> Stephan
Received on Fri Jan 04 2013 - 05:48:41 CST

This archive was generated by hypermail 2.2.0 : Fri Jan 04 2013 - 05:48:42 CST