RE: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

From: Shawn Steele <>
Date: Wed, 4 Jun 2014 18:10:05 +0000

The BOM I've seen (not FFFE though), it's prevalence depends on the system and other factors.

The others I only see if there's corruption, bugs, or tests. The most common error I see that causes those is when some developer calls a binary blob a unicode string and tries to shove it through a text transport or something. Usually that bites them sooner or later.


-----Original Message-----
From: Unicode [] On Behalf Of Doug Ewell
Sent: Wednesday, June 4, 2014 11:01 AM
Subject: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

How common is it to see any of the following in real-world Unicode text, as opposed to code charts and test suites and the like?

1. Unpaired surrogates
2. Noncharacters (besides CLDR data)
3. U+FEFF at the beginning of a stream (note: not "packet" or arbitrary cutoff point)

I'm not asking whether any of these are recommended or "prohibited" or whether they are a good idea. I'm asking about actual usage.

Doug Ewell | Thornton, CO, USA | @DougEwell
Unicode mailing list
Unicode mailing list
Received on Wed Jun 04 2014 - 13:11:00 CDT

This archive was generated by hypermail 2.2.0 : Wed Jun 04 2014 - 13:11:00 CDT