Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

From: Ken Whistler via Unicode <unicode_at_unicode.org>
Date: Thu, 1 Jun 2017 19:19:51 -0700

On 6/1/2017 6:21 PM, Richard Wordingham via Unicode wrote:
>> By definition D39b, either sequence of bytes, if encountered by an
>> conformant UTF-8 conversion process, would be interpreted as a
>> sequence of 6 maximal subparts of an ill-formed subsequence.
> ("D39b" is a typo for "D93b".)

Sorry about that. :)

>
> Conformant with what? There is no mandatory*requirement* for a UTF-8
> conversion process conformant with Unicode to have any concept of
> 'maximal subpart'.

Conformant with the definition of UTF-8. I agree that nothing forces a
conversion *process* to care anything about maximal subparts, but if
*any* process using a conformant definition of UTF-8 then goes on to
have any concept of "maximal subpart of an ill-formed subsequence" that
departs from definition D93b in the Unicode Standard, then it is just
making s**t up.

>
>> I don't see a good reason to build in special logic to treat FC 80 80
>> 80 80 80 as somehow privileged as a unit for conversion fallback,
>> simply because*if* UTF-8 were defined as the Unix gods intended
>> (which it ain't no longer) then that sequence*could* be interpreted
>> as an out-of-bounds scalar value (which it ain't) on spec that the
>> codespace*might* be extended past 10FFFF at some indefinite time in
>> the future (which it won't).
> Arguably, it requires special logic to treat FC 80 80 80 80 80 as an
> invalid sequence.

That would be equally true of FF FF FF FF FF FF. Which was my point,
actually.

> FC is not ASCII,

True, of course. But irrelevant. Because we are talking about UTF-8
here. And just because some non-UTF-8 character encoding happened to
include 0xFC as a valid (or invalid) value, might not require any
special case processing. A simple 8-bit to 8-bit conversion table could
be completely regular in its processing of 0xFC for a conversion.

> and has more than one leading bit
> set. It has the six leading bits set,

True, of course.

> and therefore should start a
> sequence of 6 characters.

That is completely false, and has nothing to do with the current
definition of UTF-8.

The current, normative definition of UTF-8, in the Unicode Standard, and
in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly "obsoletes and
replaces RFC 2279") states clearly that 0xFC cannot start a sequence of
anything identifiable as UTF-8.

--Ken

>
> Richard.
>
Received on Thu Jun 01 2017 - 21:20:11 CDT

This archive was generated by hypermail 2.2.0 : Thu Jun 01 2017 - 21:20:11 CDT