Re: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 9 May 2015 07:55:17 +0200

2015-05-09 6:37 GMT+02:00 Markus Scherer <markus.icu_at_gmail.com>:

> On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
>
>> 2015-05-09 5:13 GMT+02:00 Richard Wordingham <
>> richard.wordingham_at_ntlworld.com>:
>>
>>> I can't think of a practical use for the specific concepts of Unicode
>>> 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are
>>> essentially the same as 16-bit strings, and Unicode 32-bit strings are
>>> UTF-32 strings. 'Unicode 8-bit string' strikes me as an exercise in
>>> pedantry; there are more useful categories of 8-bit strings that are
>>> not UTF-8 strings.
>>>
>>
>> And here you're wrong: a 16-bit string is just a sequence of arbitrary
>> 16-bit code units, but an Unicode string (whatever the size of its code
>> units) adds restrictions for validity (the only restriction being in fact
>> that surrogates (when present in 16-bit strings, i.e. UTF-16) must be
>> paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are
>> forbidden.
>>
>
> No, Richard had it right. See for example definition D82 "Unicode 16-bit
> string" in the standard. (Section 3.9 Unicode Encoding Forms,
> http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf)
>

I was right, D82 refers to "UTF-16", which implies the restriction of
validity, i.e. NO isolated/unpaired surrogates,(but no exclusion of
non-characters).

I was right, You and Richard were wrong.
Received on Sat May 09 2015 - 00:57:48 CDT

This archive was generated by hypermail 2.2.0 : Sat May 09 2015 - 00:57:49 CDT