Re: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

From: Philippe Verdy <>
Date: Sat, 9 May 2015 06:13:33 +0200

2015-05-09 5:13 GMT+02:00 Richard Wordingham <>:

> I can't think of a practical use for the specific concepts of Unicode
> 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are
> essentially the same as 16-bit strings, and Unicode 32-bit strings are
> UTF-32 strings. 'Unicode 8-bit string' strikes me as an exercise in
> pedantry; there are more useful categories of 8-bit strings that are
> not UTF-8 strings.

And here you're wrong: a 16-bit string is just a sequence of arbitrary
16-bit code units, but an Unicode string (whatever the size of its code
units) adds restrictions for validity (the only restriction being in fact
that surrogates (when present in 16-bit strings, i.e. UTF-16) must be
paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are

So the concept of "Unicode string" is in fact the same as valid Unicode
text: it is a subset of possible strings, restricted by validation rules:
- for 8-bit strings (UTF-8) there are other constraints (not all bytes are
acceptable and some pairs of bytes are also restricted, and final bytes
cannot occur alone)
- for 16-bit strings (UTF-16), the only constraint is on isolated/unpaired
- for 32-bit strings (UTF-32), the only constaint is on the two allowed
ranges of encoded code points (U+0000..U+D7FF and U+E000..U+10FFFF).

For being "plain-text" there are additional restrictions: non-characters
are also excluded, and only a small subset of controls (basically tabs and
newlines) is allowed (the other controls, including U+0000 are restricted
for private protocols and not designed for plain text... except
specifically in a few legacy encoded 8-bit "charsets" like VISCII or ISO
2022 or Videotext which need these controls in fact to represent characters
into sequences, possibly with contextual encoding).
Received on Fri May 08 2015 - 23:15:00 CDT

This archive was generated by hypermail 2.2.0 : Fri May 08 2015 - 23:15:00 CDT