Re: Signature for SCSU

From: Doug Ewell (dewell@compuserve.com)
Date: Fri Jul 21 2000 - 10:23:36 EDT


David Starner <dvdeug@x8b4e53cd.dhcp.okstate.edu> wrote:

> So an initial 1B A5 FF is or is not a BOM?

That is correct, it is or is not. :-)

Unfortunately, despite the recommendation in the TR, you have no
guarantee that an initial U+FEFF intended as BOM will be encoded 0E FE
FF while an initial U+FEFF intended as ZWNBSP will be encoded in some
other form, such as 1B A5 FF. There are at least two reasons for this:

1. Some encoders were written before March 2000, when Section 8.4 was
   added to the TR suggesting the use of the 0E FE FF signature.

2. Not everyone who writes a compressor will choose to follow the
   recommendation anyway. It's not normative, after all.

You can take advantage of the recommendation by auto-detecting a file
that begins with 0E FE FF as SCSU-encoded, but in general, you can't
really use the encoded result to make decisions about whether an initial
U+FEFF represents BOM or ZWNBSP.

Of course, this is just another chapter in the continuing horror story
of the overloaded BOM. I think I asked this once before, but I will
try again:

    Can *anyone* think of a reason why a file or stream should begin
    with a zero-width no-break space?

Please don't tell me about a process that breaks a file into pieces
and how a real ZWNBSP might appear at the beginning of one of those
pieces, because such a process had better not be adding, deleting or
changing bytes in the first place. (What if it is dealing with a JPEG
or ZIP file instead? Stripping bytes would be much more catastrophic
there than in a Unicode text file.) Zero-width no-break space only
makes sense *between* characters, if you think about the definition and
examples.

If only a definition existed that said U+FEFF appearing at the *true*
beginning of a file or stream MUST be a BOM, then people like David
would not have to use these non-standard techniques for deciding which
or two unrelated hats U+FEFF is wearing. Of course, if WG2 approves
U+2060 ZERO WIDTH WORD JOINER and everyone can agree to deprecate the
use of U+FEFF as ZWNBSP, we will be well on our way out of the mess.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT