Re: Corrigendum #9

From: Richard Wordingham <>
Date: Wed, 25 Jun 2014 18:58:55 +0100

On Tue, 24 Jun 2014 09:16:00 -0400
CE Whitehead <> wrote:

> ME: if two sequences are canonically equivalent except that one has
> noncharacters in it, are these still canonically equivalent?

Canonical equivalences are defined for all sequences of scalar values;
it is just that it changes from version to version for most unassigned

Non-characters only decompose to themselves and do not
occur in the canonical (or indeed compatibility) decomposition of
anything else, so a sequence containing a non-character cannot be
canonically equivalent to a seqeunce not containing a non-character.

> Regarding the sentinels; I am an outsider but assume that with
> Corrigendum 9 U+FFFE will continue to be mentioned as having
> generally (not always?) standard use throughout; in Chapter 16.7 it
> is currently mentioned; I assume it will still be -- according to
> info. in the FAQ and elsewhere:
> "U+FFFE. The 16-bit
> unsigned hexadecimal value U+FFFE is not a Unicode character value,
> and should be taken as a signal that Unicode characters should be
> byte-swapped before interpretation. U+FFFE should only be intepreted
> as an incorrectly byte-swapped version of U+FEFF"

There is a lot of untruth in that FAQ entry, alas. I think U+FFFE
and possibly U+FFFF should be treated differently to the other 64
non-characters. At present there is no certainty as to whether
an interchanged file in the UTF-16 encoding scheme that appears to
contain a BOM contains a BOM or starts with U+FFFE. The only
promise is that such a file contains an even number of data bytes.
Any such sequence is valid! Will the UTF-16 encoding scheme be

Unicode mailing list
Received on Wed Jun 25 2014 - 13:01:09 CDT

This archive was generated by hypermail 2.2.0 : Wed Jun 25 2014 - 13:01:09 CDT