Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

From: Ken Whistler via Unicode <unicode_at_unicode.org>
Date: Thu, 1 Jun 2017 17:10:54 -0700

On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote:
> You were implicitly invited to argue that there was no need to handle
> 5 and 6 byte invalid sequences.
>

Well, working from the *current* specification:

FC 80 80 80 80 80
and
FF FF FF FF FF FF

are equal trash, uninterpretable as *anything* in UTF-8.

By definition D39b, either sequence of bytes, if encountered by an
conformant UTF-8 conversion process, would be interpreted as a sequence
of 6 maximal subparts of an ill-formed subsequence. Whatever your
particular strategy for conversion fallbacks for uninterpretable
sequences, it ought to treat either one of those trash sequences the
same, in my book.

I don't see a good reason to build in special logic to treat FC 80 80 80
80 80 as somehow privileged as a unit for conversion fallback, simply
because *if* UTF-8 were defined as the Unix gods intended (which it
ain't no longer) then that sequence *could* be interpreted as an
out-of-bounds scalar value (which it ain't) on spec that the codespace
*might* be extended past 10FFFF at some indefinite time in the future
(which it won't).

--Ken
Received on Thu Jun 01 2017 - 19:11:20 CDT

This archive was generated by hypermail 2.2.0 : Thu Jun 01 2017 - 19:11:20 CDT