Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

From: J Decker <>
Date: Sun, 31 Jan 2016 01:27:19 -0800

On Sun, Jan 31, 2016 at 12:21 AM, Shawn Steele
<> wrote:
> Typically XOR’ing a constant isn’t really considered worth messing with.
> It’s somewhat trivial to figure out the key to un-XOR.
obviously. It's not constant, nor is it stored anywhere in the code or data.
> On Sat, Jan 30, 2016, 6:31 PM J Decker <> wrote:
> On Sat, Jan 30, 2016 at 4:45 PM, Shawn Steele
> <> wrote:
>> Why do you need illegal unicode code points?
> This originated from learning Javascript; which is internally UTF-16.
> Playing with localStorage, some browsers use a sqlite3 database to
> store values. The database is UTF-8 so there must be a valid
> conversion between the internal UTF-16 and UTF-8 localStorage (and
> reverse). I wanted to obfuscate the data stored for a certain
> application; and cover all content that someone might send. Having
> slept on this, I realized that even if hieroglyphics were stored, if I
> pulled out the character using codePointAt() and applied a 20 bit
> random value to it using XOR it could end up as a normal character,
> and I wouldn't know I had to use a 20 bit value... so every character
> would have to use a 20 bit mask (which could end up with a value
> that's D800-DFFF).
> I've reconsidered and think for ease of implementation to just mask
> every UTF-16 character (not codepoint) with a 10 bit value, This will
> result in no character changing from BMP space to surrogate-pair or
> vice-versa.
> Thanks for the feedback.
> (sorry if I've used some terms inaccurately)
>> -----Original Message-----
>> From: Unicode [] On Behalf Of J Decker
>> Sent: Saturday, January 30, 2016 6:40 AM
>> To:
>> Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
>> I do see that the code points D800-DFFF should not be encoded in any UTF
>> format (UTF8/32)...
>> UTF8 has a way to define any byte that might otherwise be used as an
>> encoding byte.
>> UTF16 has no way to define a code point that is D800-DFFF; this is an
>> issue if I want to apply some sort of encryption algorithm and still have
>> the result treated as text for transmission and encoding to other string
>> systems.
>> lists Unicode
>> private areas Area-A which is U-F0000:U-FFFFD and Area-B which is
>> U-100000:U-10FFFD which will suffice for a workaround for my purposes....
>> For my purposes I will implement F0000-F0800 to be (code point minus
>> D800 and then add F0000 (or vice versa)) and then encoded as a surrogate
>> pair... it would have been super nice of unicode standards included a way to
>> specify code point even if there isn't a language character assigned to that
>> point.
>> does say: "Q: Are there any 16-bit values that are invalid?
>> A: Unpaired surrogates are invalid in UTFs. These include any value in the
>> range D800 to DBFF not followed by a value in the range DC00 to DFFF, or any
>> value in the range DC00 to DFFF not preceded by a value in the range D800 to
>> DBFF "
>> and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?
>> A different issue arises if an unpaired surrogate is encountered when
>> converting ill-formed UTF-16 data. By represented such an unpaired surrogate
>> on its own as a 3-byte sequence, the resulting UTF-8 data stream would
>> become ill-formed. While it faithfully reflects the nature of the input,
>> Unicode conformance requires that encoding form conversion always results in
>> valid data stream. Therefore a converter must treat this as an error. "
>> I did see these older messages... (not that they talk about this much just
>> more info)
Received on Sun Jan 31 2016 - 03:28:29 CST

This archive was generated by hypermail 2.2.0 : Sun Jan 31 2016 - 03:28:29 CST