Re: Surrogates and noncharacters

From: Hans Aberg <haberg-1_at_telia.com>
Date: Mon, 11 May 2015 20:05:23 +0200

> On 11 May 2015, at 19:44, Doug Ewell <doug_at_ewellic.org> wrote:
>
> Hans Aberg <haberg dash 1 at telia dot com> wrote:
>
>>>> However I wonder what would be the effect of D80 in UTF-32: is
>>>> <0xFFFFFFFF> a valid "32-bit string" ?
>>>
>>> The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it
>>> cannot represent a unit of encoded text in a UTF-32 string.
>>
>> Even though the values with highest bit set are not a part of original
>> UTF-32, it can easily be extended also to original UTF-8, which may be
>> simpler to implement.
>
> "Original UTF-8," regardless of where defined, only ever encoded scalar
> values up to 0x7FFFFFFF. See, for example, RFC 2279.

The intended meaning is that also original UTF-8 can be extended to full 32-bit by using 6-byte sequences leading byte 111111xx bit pattern.
Received on Mon May 11 2015 - 13:06:33 CDT

This archive was generated by hypermail 2.2.0 : Mon May 11 2015 - 13:06:33 CDT