Re: Surrogates and noncharacters from Richard Wordingham on 2015-05-09 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 9 May 2015 16:51:21 +0100

On Sat, 9 May 2015 16:54:30 +0200
Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> 2015-05-09 16:26 GMT+02:00 Richard Wordingham <
> richard.wordingham_at_ntlworld.com>:
>
> > In particular, I claim that all 6 permutations of <D800, 0054, DCC1>
> > are Unicode strings, but that only two, namely <D800, DCC1, 0054>
> > and <0054, D800, DCC1>, are UTF-16 strings.
> >
>
> Again you use "Unicode strings" for your 6 permutations, but in your
> example they have nothing that make them "Unicode strings", given you
> allow arbitrary code units in arbitrary order, including unpaired
> ones. The 6 permutations are just "16-bit strings" (addding "Unicode"
> for these 6 permutations gives absolutely no value if you keep your
> definition, but visibly it cannot fit with the term used in the RFC
> trying to normalize JSON, with similar confusions !).

> TUS does not define what is a "Unicode string" like you do here.

D80 _Unicode string:_ A code unit sequence containing code units of
a particular Unicode encoding form

RW: Note that by this definition, a permutation of a Unicode string is
a Unicode string.

D82 _Unicode 16-bit string:_ A Unicode string containing only UTF-16
code units.

D85 _Well-formed:_ A Unicode code unit sequence that purports to be
in a Unicode encoding form is called well-formed if and only if it
_does_ follow the specification of that Unicode encoding form

D89 _In a Unicode encoding form:_ A Unicode string is said to be in
a particular Unicode encoding form if and only if it consists of a
well-formed Unicode code unit sequence of that Unicode encoding form.
• A Unicode string consisting of a well-formed UTF-8 code unit
sequence is said to be _in UTF-8_. Such a Unicode string is referred to
as a _valid UTF-8 string_, or a _UTF-8 string_ for short.
• A Unicode string consisting of a well-formed UTF-16 code unit
sequence is said to be _in UTF-16_. Such a Unicode string is referred to
as a _valid UTF-16 string_, or a _UTF-16 string_ for short.
• A Unicode string consisting of a well-formed UTF-32 code unit
sequence is said to be _in UTF-32_. Such a Unicode string is referred to
as a _valid UTF-32 string_, or a _UTF-32 string_ for short.

> TUS just defines "Unicode 16-bit strings" with a direct reference to
> UTF-16 (which implies conformance and only accepts the later two
> strings, that TUS names "Unicode 16-bit strings", not "UTF-16
> strings"...)

Look at D82 again. It refers to UTF-16 code units and does not
otherwise reference UTF-16.

If you still do not believe me, consider D89. Can you think of an
example of a Unicode string consisting of UTF-8 code units, UTF-16
code units or UTF-32 code units that is not a UTF-8 string, not a
UTF-16 and is not a UTF-32 string? If you can't, the use of
"well-formed" is curiously redundant in D89.

Richard.
Received on Sat May 09 2015 - 10:53:20 CDT

This archive was generated by hypermail 2.2.0 : Sat May 09 2015 - 10:53:20 CDT