David Starner <firstname.lastname@example.org> wrote:
> So an initial 1B A5 FF is or is not a BOM?
That is correct, it is or is not. :-)
Unfortunately, despite the recommendation in the TR, you have no
guarantee that an initial U+FEFF intended as BOM will be encoded 0E FE
FF while an initial U+FEFF intended as ZWNBSP will be encoded in some
other form, such as 1B A5 FF. There are at least two reasons for this:
1. Some encoders were written before March 2000, when Section 8.4 was
added to the TR suggesting the use of the 0E FE FF signature.
2. Not everyone who writes a compressor will choose to follow the
recommendation anyway. It's not normative, after all.
You can take advantage of the recommendation by auto-detecting a file
that begins with 0E FE FF as SCSU-encoded, but in general, you can't
really use the encoded result to make decisions about whether an initial
U+FEFF represents BOM or ZWNBSP.
Of course, this is just another chapter in the continuing horror story
of the overloaded BOM. I think I asked this once before, but I will
Can *anyone* think of a reason why a file or stream should begin
with a zero-width no-break space?
Please don't tell me about a process that breaks a file into pieces
and how a real ZWNBSP might appear at the beginning of one of those
pieces, because such a process had better not be adding, deleting or
changing bytes in the first place. (What if it is dealing with a JPEG
or ZIP file instead? Stripping bytes would be much more catastrophic
there than in a Unicode text file.) Zero-width no-break space only
makes sense *between* characters, if you think about the definition and
If only a definition existed that said U+FEFF appearing at the *true*
beginning of a file or stream MUST be a BOM, then people like David
would not have to use these non-standard techniques for deciding which
or two unrelated hats U+FEFF is wearing. Of course, if WG2 approves
U+2060 ZERO WIDTH WORD JOINER and everyone can agree to deprecate the
use of U+FEFF as ZWNBSP, we will be well on our way out of the mess.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT