Re: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Fri, 8 May 2015 21:37:40 -0700

On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> 2015-05-09 5:13 GMT+02:00 Richard Wordingham <
> richard.wordingham_at_ntlworld.com>:
>
>> I can't think of a practical use for the specific concepts of Unicode
>> 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are
>> essentially the same as 16-bit strings, and Unicode 32-bit strings are
>> UTF-32 strings. 'Unicode 8-bit string' strikes me as an exercise in
>> pedantry; there are more useful categories of 8-bit strings that are
>> not UTF-8 strings.
>>
>
> And here you're wrong: a 16-bit string is just a sequence of arbitrary
> 16-bit code units, but an Unicode string (whatever the size of its code
> units) adds restrictions for validity (the only restriction being in fact
> that surrogates (when present in 16-bit strings, i.e. UTF-16) must be
> paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are
> forbidden.
>

No, Richard had it right. See for example definition D82 "Unicode 16-bit
string" in the standard. (Section 3.9 Unicode Encoding Forms,
http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf)

I agree that the definitions for Unicode 8-bit and 32-bit strings are not
particularly useful.

For being "plain-text" there are additional restrictions: non-characters
> are also excluded, and only a small subset of controls (basically tabs and
> newlines) is allowed (the other controls, including U+0000 are restricted
> for private protocols and not designed for plain text... except
> specifically in a few legacy encoded 8-bit "charsets" like VISCII or ISO
> 2022 or Videotext which need these controls in fact to represent characters
> into sequences, possibly with contextual encoding).
>

Where did you find that definition of "plain text"?
Unicode just defines "plain text" by contrast with "rich text" which is
text with markup or other such structure. There is no limitation of code
points associated with that term.
http://unicode.org/glossary/#plain_text

markus
Received on Fri May 08 2015 - 23:38:36 CDT

This archive was generated by hypermail 2.2.0 : Fri May 08 2015 - 23:38:37 CDT