How many possible characters? (was: Re: Names of planes...)

From: Doug Ewell (dewell@compuserve.com)
Date: Wed Jul 12 2000 - 02:01:39 EDT


Asmus Freytag <asmusf@ix.netcom.com> wrote:

> There are 0x10FFFF - 34 possible characters!
>
> All code values ending in 0xFFFE and OxFFFF do *not* refer to
> characters. They are not just temporarily unassigned, but permanently
> reserved as non-characters.

Right, but we should start with 0x110000, not 0x10FFFF (since U+0000
NULL is a perfectly legitimate character), then subtract 34 (U+??FFFE
and U+??FFFF for each of 17 planes), then subtract another 2,048 for
the surrogate codepoints (U+D800 through U+DFFF). That leaves us with
1,112,030 possible characters. There will be a test next period.

Then Robert Lozyniak <11digitboy@bolt.com> wrote:

> Okay, 0x10FFDE different characters. But what of planes 15 and 16?

Planes 15 and 16 are for private-use characters, just like the range
from U+E000 to U+F8FF. These still count as "possible characters."

and then "john" <john@nisus.com> wrote:

> Clarification request: Does that mean
> None of the code values ending in 0xFFFE and 0xFFFF refer to
> characters?
>
> or
>
> Not all of the code values ending in 0xFFFE and 0xFFFF refer to
> characters (i..e some do and some do not)?

The first one. For all x where ((x & 0x00FFFE) == 0x00FFFE), x is not
a valid character.

BTW, it's interesting that the FAQ claims this is "for no good reason,"
when in fact I can think of a good reason to at least exclude the
characters ending in FFFE: if expressed in UTF-32 little-endian and
appearing at the beginning of a file, they could fool an auto-detection
scheme into thinking the file is UTF-16 big-endian.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT