Re: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

From: Asmus Freytag <asmusf_at_ix.netcom.com>
Date: Wed, 04 Jun 2014 12:52:02 -0700

On 6/4/2014 12:21 PM, Richard Wordingham wrote:
> On Wed, 04 Jun 2014 11:40:11 -0700
> Asmus Freytag <asmusf_at_ix.netcom.com> wrote:
>
>> On 6/4/2014 11:26 AM, Doug Ewell wrote:
>>> I meant U+FEFF as a zero-width no-break space. Obviously it is very
>>> common to see U+FEFF as a signature or BOM.
>> The semantics of it were chosen at the time to make no sense
>> at the start, and to make the character invisible in most situations.
>> The remnant of its semantic was later taken up by Word Joiner, so that
>> there is now NO use for this as part of text.
>
>> The use as part of a convention has always been clear. If you stick
>> this at the front, readers will byte-reverse your data; that should
>> weed out accidental use pretty quickly :) Or prevent people from
>> getting "cute" with it in other ways.
> Wrong! If you stick U+FEFF at the start of a file, expect it to be
> stripped. If you stick U+FFFE at the start of a file, then expect to
> see the rest of the text to be byte-reversed.
Duh. (reminder, have coffee first)

A./
>
>> So, I would think that for this particular code point, you can safely
>> assume that it's buggy or test data.
> The example that's usually given is that of a text file sliced into
> segments to avoid file size limits. In these cases, there is the risk
> that U+FEFF as ZWNBSP will wind up at the start of a segment and be
> stripped. The solution using the Windows command window is to perform a
> *binary* concatenation of the segments; if one doesn't, newlines will
> be inserted between the segments, which is much severer damage.
>
> Richard.
>

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Wed Jun 04 2014 - 14:53:21 CDT

This archive was generated by hypermail 2.2.0 : Wed Jun 04 2014 - 14:53:21 CDT