Re: Running out of code points, redux (was: Re: Feedback on the proposal...) from Richard Wordingham via Unicode on 2017-06-01 (Unicode Mail List Archive)

From: Richard Wordingham via Unicode <unicode_at_unicode.org>
Date: Fri, 2 Jun 2017 04:32:35 +0100

On Thu, 1 Jun 2017 19:19:51 -0700
Ken Whistler via Unicode <unicode_at_unicode.org> wrote:

> > and therefore should start a
> > sequence of 6 characters.
>
> That is completely false, and has nothing to do with the current
> definition of UTF-8.
>
> The current, normative definition of UTF-8, in the Unicode Standard,
> and in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly
> "obsoletes and replaces RFC 2279") states clearly that 0xFC cannot
> start a sequence of anything identifiable as UTF-8.

TUS Section 3 is like the Augean Stables. It is a complete mess as a
standards document, imputing mental states to computing processes.

Table 3-7 for example, should be a consequence of a 'definition' that
UTF-8 only represents Unicode Scalar values and excludes 'non-shortest
forms'. Instead, the exclusion of the sequence <ED A0 80> is presented
as a brute definition, rather than as a consequence of 0xD800 not being
a Unicode scalar value. Likewise, 0xFC fails to be legal because it
would define either a 'non-shortest form' or a value that is not a
Unicode scalar value.

The differences are a matter of presentation; the outcome as to what is
permitted is the same. The difference lies rather in whether the rules
are comprehensible. A comprehensible definition is more likely to be
implemented correctly. Where the presentation makes a difference is in
how malformed sequences are naturally handled.

Richard.
Received on Thu Jun 01 2017 - 22:32:57 CDT

This archive was generated by hypermail 2.2.0 : Thu Jun 01 2017 - 22:32:57 CDT