Re: holes (unassigned code points) in the code charts

From: Philippe Verdy <>
Date: Fri, 4 Jan 2013 15:15:16 +0100

2013/1/4 Asmus Freytag <>:
> On 1/4/2013 2:36 AM, Stephan Stiller wrote:
>> All,
>> There are plenty of unassigned code points within blocks that are in use;
>> these often come at the end of a block but there are plenty of holes as
>> well.
>> I have a cluster of interrelated questions:
>> 1. What sorts of reasons are there (or have there been) for leaving holes?
>> Code page conversion and changes to casing by simple arithmetic? What else?
> There are a number of reasons why a code chart may not be contiguous besides
> the reason you give. Sometimes, a character gets removed from the draft at
> last minute, In those cases, a hole may be left. In general, the possible
> reasons for leaving a hole can not be enumerated in a fixed list. It's more
> of a case-by-case thing.

And sometimes the holes are left pending a further decision. It
remains reserved for a while as long as the proposed character has not
been formally rejected. Sometimes holes are coming from simple
mappings from legacy encodings, just to preserve the relative order.
The holes were not allocated because the legacy encoding referenced a
character already encoded elsewhere.

These holes, initially kept to preserve compatibility with simple
mappings of legacy encodings and with some fonts may be left empty for
long (even though the font assignments are normally invalid: this is
the case in the block of Windings symbols). For normal scripts
(alphabets, abjads, alphasyllabaries, sinograms, ideographs), they may
be allocated later for completely unrelated new characters in the same
script (as long as there's evidence that this script will likely
include more historic characters in the future : this is the case for
Latin, Arabic, Cyrillic, and many Indic scripts, and for blocks
containing puntuations, mathematical symbols, and pictograms like
emojis or game symbols like deck cards).

As long as a single proposal can fit in existing holes of existing
blocks, no new block would be allocated, but if the proposal contains
more characters than those that can fit in a hole, a new block will be
allocated to fit them all at once (allowing new fonts to be added to
support all of them at once, without having to update many fonts for
the full coverage of the accepted proposal, thus simplifying the
implementation, deployment and usage). Many proposals just consist in
a single or very few characters : slowly they will fill the holes left
in blocks by prior assignments.

I think that the rationale is to allow grouping together characters
that will be used together and in the same fonts (notably if there are
contextual substitution rules or ligatures).

Just look at the history of Unicode versions in the Extended Latin
blocks, and you'll find these later allocations filling holes left by
prior assignments. The roadmap also reveals some info about the
estimated number of characters for which there are pending proposals.
Very often they are referencing these holes, but these proposals will
not be concluded before a long time, and these proposals must avoid
colliding each other, competing for the same positions after the
initial encoding steps have been passed but not finalized, or the
proposal finally abandoned completely by a newer more complete
proposal. Many proposals will take months or years to be completed,
even if their blocks are already accepted and are encoding a small
part of the needed characters.
Received on Fri Jan 04 2013 - 08:17:28 CST

This archive was generated by hypermail 2.2.0 : Fri Jan 04 2013 - 08:17:28 CST