Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

From: J Decker <d3ck0r_at_gmail.com>
Date: Sat, 30 Jan 2016 06:40:23 -0800

I do see that the code points D800-DFFF should not be encoded in any
UTF format (UTF8/32)...

UTF8 has a way to define any byte that might otherwise be used as an
encoding byte.

UTF16 has no way to define a code point that is D800-DFFF; this is an
issue if I want to apply some sort of encryption algorithm and still
have the result treated as text for transmission and encoding to other
string systems.

http://www.azillionmonkeys.com/qed/unicode.html lists Unicode
private areas Area-A which is U-F0000:U-FFFFD and Area-B which is
U-100000:U-10FFFD which will suffice for a workaround for my

For my purposes I will implement F0000-F0800 to be (code point minus
D800 and then add F0000 (or vice versa)) and then encoded as a
surrogate pair... it would have been super nice of unicode standards
included a way to specify code point even if there isn't a language
character assigned to that point.

does say: "Q: Are there any 16-bit values that are invalid?

A: Unpaired surrogates are invalid in UTFs. These include any value in
the range D800 to DBFF not followed by a value in the range DC00 to
DFFF, or any value in the range DC00 to DFFF not preceded by a value
in the range D800 to DBFF

and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?

A different issue arises if an unpaired surrogate is encountered when
converting ill-formed UTF-16 data. By represented such an unpaired
surrogate on its own as a 3-byte sequence, the resulting UTF-8 data
stream would become ill-formed. While it faithfully reflects the
nature of the input, Unicode conformance requires that encoding form
conversion always results in valid data stream. Therefore a converter
must treat this as an error. "

I did see these older messages... (not that they talk about this much
just more info)
Received on Sat Jan 30 2016 - 10:26:41 CST

This archive was generated by hypermail 2.2.0 : Sat Jan 30 2016 - 10:26:42 CST