Re: Surrogates and noncharacters

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sun, 10 May 2015 11:23:41 +0100

On Sun, 10 May 2015 07:42:14 +0200
Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

I as replying out of order for greater coherence of my reply.

> However I wonder what would be the effect of D80 in UTF-32: is
> <0xFFFFFFFF> a valid "32-bit string" ? After all it is also
> containing a single 32-bit code unit (for at least one Unicode
> encoding form), even if it has no "scalar value" and then does not
> have to validate D89 (for UTF-32)...

The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it
cannot represent a unit of encoded text in a UTF-32 string. By D77
paragraph 1, "Code unit: The minimal bit combination that can
represent a unit of encoded text for processing or interchange", it is
therefore not a code unit. The effect of D77, D80 and D83 is that
<0xFFFFFFFF> is a 32-bit string but not a Unicode 32-bit string.

> - D80 defines "Unicode string" but in fact it just defines a generic
> "string" as an arbitrary stream of fixed-size code units.

No - see argument above.

> These two rules [D80 and D82 - RW] are not productive at all, except
> for saying that all values of fixed size code units are acceptable
> (including for example 0xFF in 8-bit strings, which is invalid in
> UTF-8)

Do you still maintain this reading of D77? D77 is not as clear as it
should be.

> <snip> D80 and D82 have no purpose, except adding the term "Unicode"
> redundantly to these expressions.

I have the cynical suspicion that these definitions were added to
preserve the interface definitions of routines processing UCS-2
strings when the transition to UTF-16 occurred. They can also have the
(intentional?) side-effect of making more work for UTF-8 and UTF-32
processing, because arbitrary 8-bit strings and 32-bit strings are not
Unicode strings.

Richard.
Received on Sun May 10 2015 - 05:25:04 CDT

This archive was generated by hypermail 2.2.0 : Sun May 10 2015 - 05:25:05 CDT