Re: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Wed, 4 Jun 2014 20:21:03 +0100

On Wed, 04 Jun 2014 11:40:11 -0700
Asmus Freytag <asmusf_at_ix.netcom.com> wrote:

> On 6/4/2014 11:26 AM, Doug Ewell wrote:

> > I meant U+FEFF as a zero-width no-break space. Obviously it is very
> > common to see U+FEFF as a signature or BOM.

> The semantics of it were chosen at the time to make no sense
> at the start, and to make the character invisible in most situations.
> The remnant of its semantic was later taken up by Word Joiner, so that
> there is now NO use for this as part of text.
 
> The use as part of a convention has always been clear. If you stick
> this at the front, readers will byte-reverse your data; that should
> weed out accidental use pretty quickly :) Or prevent people from
> getting "cute" with it in other ways.

Wrong! If you stick U+FEFF at the start of a file, expect it to be
stripped. If you stick U+FFFE at the start of a file, then expect to
see the rest of the text to be byte-reversed.

> So, I would think that for this particular code point, you can safely
> assume that it's buggy or test data.

The example that's usually given is that of a text file sliced into
segments to avoid file size limits. In these cases, there is the risk
that U+FEFF as ZWNBSP will wind up at the start of a segment and be
stripped. The solution using the Windows command window is to perform a
*binary* concatenation of the segments; if one doesn't, newlines will
be inserted between the segments, which is much severer damage.

Richard.
_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Wed Jun 04 2014 - 14:22:06 CDT

This archive was generated by hypermail 2.2.0 : Wed Jun 04 2014 - 14:22:06 CDT