Re: Surrogates and noncharacters

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 9 May 2015 10:59:57 +0100

On Sat, 9 May 2015 07:55:17 +0200
Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> 2015-05-09 6:37 GMT+02:00 Markus Scherer <markus.icu_at_gmail.com>:
>
> > On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy <verdy_p_at_wanadoo.fr>
> > wrote:

> >> 2015-05-09 5:13 GMT+02:00 Richard Wordingham <
> >> richard.wordingham_at_ntlworld.com>:

WARNING: This post belongs in pedants' corner, or possibly a pantomime.

> >>> I can't think of a practical use for the specific concepts of
> >>> Unicode 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings
> >>> are essentially the same as 16-bit strings, and Unicode 32-bit
> >>> strings are UTF-32 strings. 'Unicode 8-bit string' strikes me
> >>> as an exercise in pedantry; there are more useful categories of
> >>> 8-bit strings that are not UTF-8 strings.

> >> And here you're wrong: a 16-bit string is just a sequence of
> >> arbitrary 16-bit code units, but an Unicode string (whatever the
> >> size of its code units) adds restrictions for validity (the only
> >> restriction being in fact that surrogates (when present in 16-bit
> >> strings, i.e. UTF-16) must be paired, and in 32-bit (UTF-32) and
> >> 8-bit (UTF-8) strings, surrogates are forbidden.

You are thinking of a Unicode string as a sequence of codepoints. Now
that may be a linguistically natural interpretation of 'Unicode
string', but 'Unicode string' has a different interpretation, given in
D80. A 'Unicode string' (D80) is a sequence of code-units occurring in
some Unicode encoding form. By this definition, every permutation of
the code-units in a Unicode string is itself a Unicode string. UTF-16
is unique in that every code-unit corresponds to a codepoint. (We
could extend the Unicode codespace (D9, D10) by adding integers for the
bytes of multibyte UTF-8 encodings, but I see no benefit.)

A Unicode 8-bit string may have no interpretation as a sequence of
codepoints. For example, the 8-bit string <C2, A0> is a Unicode 8-bit
string denoting a sequence of one Unicode scalar value, namely U+00A0.
<A0, A0> is therefore also a Unicode 8-bit string, but it has no
defined or obvious interpretation as a codepoint; it is *not* a UTF-8
string. The string <E0, 80, 80> is also a Unicode 8-bit string, but is
not a UTF-8 string because the sequence is not the shortest
representation of U+0000. The 8-bit string <C0, 80> is *not* a Unicode
8-bit string, for the byte C0 does not occur in well-formed UTF-8; one
does not even need to note that it is not the shortest representation
of U+0000.

> > No, Richard had it right. See for example definition D82 "Unicode
> > 16-bit string" in the standard. (Section 3.9 Unicode Encoding Forms,
> > http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf)

> I was right, D82 refers to "UTF-16", which implies the restriction of
> validity, i.e. NO isolated/unpaired surrogates,(but no exclusion of
> non-characters).

No, D82 merely requires that each 16-bit value be a valid UTF-16 code
unit. Unicode strings, and Unicode 16-bit strings in particular, need
not be well-formed. For x = 8, 16, 32, a 'UTF-x string', equivalently a
'valid UTF-x string', is one that is well-formed in UTF-x.

> I was right, You and Richard were wrong.

I stand by my explanation. I wrote it with TUS open at the definitions
by my side.

Richard.
Received on Sat May 09 2015 - 05:01:15 CDT

This archive was generated by hypermail 2.2.0 : Sat May 09 2015 - 05:01:15 CDT