Re: UTF-8N?

From: Doug Ewell (dewell@compuserve.com)
Date: Fri Jun 23 2000 - 03:00:58 EDT


Kenneth Whistler <kenw@sybase.com> wrote:

>> It all stems from the fact that U+FEFF is not only what is used for
>> the BOM, but also a valid Unicode/ISO 10646 codepoint. The issue
>> would be solved by deprecating the use of U+FEFF as a Unicode
>> character (for example by defining a new codepoint for ZWNBSP), and
>> using U+FEFF for the BOM only.
>
> The UTC has *already* done this.
>
> U+2060 ZERO WIDTH WORD JOINER
>
> This is intended to take the function overloaded on U+FEFF as a zero-
> width non-breaking space, leaving U+FEFF to function *only* as the
> byte-order-mark.

Hooray! I'm absolutely delighted to see that the UTC has taken this
step. This could very well be the solution to our BOM problems. I hope
WG2 approves this sanity-saving proposal quickly.

I had seen the "pipeline" reference to U+2060 before, but did not draw
the connection between "ZERO WIDTH WORD JOINER" and the ZWNBSP function,
and certainly did not envision that U+FEFF would be deprecated as a
ZWNBSP.

> The decision by UTC to disunify these bizarrely unified functions,
> however, should allow us to gradually dig our way out of this mess.
>
> If you can at all help it, start refraining now from using U+FEFF as a
> zero-width non-breaking space.

I would (informally) suggest that not only should U+FEFF-as-ZWNBSP be
deprecated, it should be *strongly* discouraged as the first byte of
a file or stream. That is where 99% of the pain seems to come from:
whether to treat an *initial* U+FEFF as BOM or ZWNBSP. We have already
established (technically, I contended and nobody disputed) that ZWNBSP,
whatever its encoding, only makes real sense when placed *between* two
characters that are supposed to appear together without an intervening
visible space or line break.

The Unicode Standard has set a precedent for "strongly discouraging"
characters with the statement regarding the format characters U+206A
through U+206F, in Section 13.3 (page 320) of The Book.

One comment about the BOM: Every time the subject of BOM comes up,
someone who hasn't followed previous threads points out that UTF-8
doesn't have byte-order problems and so it doesn't need a BOM, and
That Terrible, Awful Non-POSIX Operating System Company is forcing
something on Unicode users that makes no sense. But if I may... the
reality is that files encoded in ISO 8859-1, Windows-1252, and other
single-byte character sets will continue to be with us for quite a
while, and U+FEFF as a SIGNATURE (stop worrying about the exact words
BYTE ORDER MARK for a minute) can serve as a quick and useful way to
distinguish UTF-8 files from SBCS files.

Now that Unicode plans to deprecate the use of U+FEFF as ZWNBSP,
programs that *expect* UTF-8 instead of SBCS will be able to throw away
an initial U+FEFF with even greater confidence. It may even be possible
for operating system developers to build this in at the OS level: open
a UTF-8 text file; read characters; if the very first character in the
file was U+FEFF then eat it. Applications would never even see it.
How cool would that be?

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT