Re: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 9 May 2015 04:13:52 +0100

On Sat, 9 May 2015 02:26:59 +0200
Daniel Bünzli <daniel.buenzli_at_erratique.ch> wrote:

> Le samedi, 9 mai 2015 à 00:37, Doug Ewell a écrit :
> > Noncharacters are Unicode scalar values,

> (However noncharacters are not designed to be openly interchanged see
> "Restricted interchange" on p. 31. of 7.0.0)

That didn't stop their being openly interchanged.

> > They may both be part of a "Unicode string" which does not claim to
> > be in any given encoding form.

> Not sure what you mean by that. So I let someone else answer.

There are a number of phrases whose declared meanings cannot be
deduced from the individual words. A UTF-8, UTF-16 or UTF-32 string
defines a sequence of scalar values. However, Unicode 8-bit, 16-bit
or 32-bit string is merely a sequence of 8-bit, 16-bit or 32-bit
values that may occur in a UTF-8, UTF-16 or UTF-32 string
respectively. This definition has some odd consequences:

A Unicode 32-bit string is a UTF-32 string, for UTF-32 is not a
multi-word encoding. An arbitrary string of unsigned 32-bit values is
not in general a Unicode 32-bit string.

All strings of unsigned 16-bit values are Unicode 16-bit strings. Not
all (Unicode) 16-bit strings are UTF-16 strings.

Not all strings of unsigned 8-bit values are Unicode 8-bit strings, and
not all Unicode 8-bit strings are UTF-8 strings.

I can't think of a practical use for the specific concepts of Unicode
8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are
essentially the same as 16-bit strings, and Unicode 32-bit strings are
UTF-32 strings. 'Unicode 8-bit string' strikes me as an exercise in
pedantry; there are more useful categories of 8-bit strings that are
not UTF-8 strings.

Richard.
Received on Fri May 08 2015 - 22:15:23 CDT

This archive was generated by hypermail 2.2.0 : Fri May 08 2015 - 22:15:23 CDT