Re: Signature for SCSU

From: Mark Davis (markdavis@ispchannel.com)
Date: Fri Jul 21 2000 - 11:12:35 EDT


Because of its usage, ZWNBSP is extremely unlikely at the start of a file,
but that doesn't mean it can't occur. A question mark is also extremely
unlikely, as are many other characters. However, they can occur. Unicode
doesn't forbid any sequence of characters from occurring. Stripping, say,
question marks from the start of all files would -- in some very rare cases
-- result in data corruption. If you absolutely require no data corruption,
you won't do it. Same with ZWNBSP.

By defining BE/LE forms of UTF-16 and UTF-32, we provide a mechanism for
people to declare that they are not using BOM, which makes processing
simpler and precise. There is then no question of whether to strip or add
FEFF.

Needless to say, we all wish that FEFF were not overloaded as both ZWNBSP
and BOM. (That was one of the prices for integration with 10646, but is
ancient history at this point.) As you say, if we can deprecate the use of
FEFF as ZWNBSP, then in 5 years we can be out of this mess.

Mark

Doug Ewell wrote:

> David Starner <dvdeug@x8b4e53cd.dhcp.okstate.edu> wrote:
>
> > So an initial 1B A5 FF is or is not a BOM?
>
> That is correct, it is or is not. :-)
>
> Unfortunately, despite the recommendation in the TR, you have no
> guarantee that an initial U+FEFF intended as BOM will be encoded 0E FE
> FF while an initial U+FEFF intended as ZWNBSP will be encoded in some
> other form, such as 1B A5 FF. There are at least two reasons for this:
>
> 1. Some encoders were written before March 2000, when Section 8.4 was
> added to the TR suggesting the use of the 0E FE FF signature.
>
> 2. Not everyone who writes a compressor will choose to follow the
> recommendation anyway. It's not normative, after all.
>
> You can take advantage of the recommendation by auto-detecting a file
> that begins with 0E FE FF as SCSU-encoded, but in general, you can't
> really use the encoded result to make decisions about whether an initial
> U+FEFF represents BOM or ZWNBSP.
>
> Of course, this is just another chapter in the continuing horror story
> of the overloaded BOM. I think I asked this once before, but I will
> try again:
>
> Can *anyone* think of a reason why a file or stream should begin
> with a zero-width no-break space?
>
> Please don't tell me about a process that breaks a file into pieces
> and how a real ZWNBSP might appear at the beginning of one of those
> pieces, because such a process had better not be adding, deleting or
> changing bytes in the first place. (What if it is dealing with a JPEG
> or ZIP file instead? Stripping bytes would be much more catastrophic
> there than in a Unicode text file.) Zero-width no-break space only
> makes sense *between* characters, if you think about the definition and
> examples.
>
> If only a definition existed that said U+FEFF appearing at the *true*
> beginning of a file or stream MUST be a BOM, then people like David
> would not have to use these non-standard techniques for deciding which
> or two unrelated hats U+FEFF is wearing. Of course, if WG2 approves
> U+2060 ZERO WIDTH WORD JOINER and everyone can agree to deprecate the
> use of U+FEFF as ZWNBSP, we will be well on our way out of the mess.
>
> -Doug Ewell
> Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT