Re: UTF-8N?

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jun 22 2000 - 14:03:47 EDT


Juliusz wrote:

> The problem is not one of broken software. The problem is that, as
> John Cowan explained in detail, with the addition of the BOM, UTF-8
> and UTF-16 become ambiguous.

This is putting the cart before the horse.

The U+FEFF BOM existed in Unicode 1.0, and was carried into ISO/IEC 10646-1:1993
(before any amendments). That *predates* the existence of either UTF-8
or UTF-16.

The existence and usage of the BOM was not taken fully into account
when UTF-8 and UTF-16 were defined and standardized (in Amendments 1
and 2 to 10646-1:1993), and we are *still* dealing with trying to
sort out all the implications of their interactions. (...And of the
relationship between the encoding forms and the explicitly named
encoding *schemes*, UTF16-BE, UTF16-LE, which were invented even
later.)

>
> It all stems from the fact that U+FEFF is not only what is used for
> the BOM, but also a valid Unicode/ISO 10646 codepoint. The issue
> would be solved by deprecating the use of U+FEFF as a Unicode
> character (for example by defining a new codepoint for ZWNBSP), and
> using U+FEFF for the BOM only.

The UTC has *already* done this.

U+2060 ZERO WIDTH WORD JOINER

This is intended to take the function overloaded on U+FEFF as a zero-width
non-breaking space, leaving U+FEFF to function *only* as the byte-order-mark.

This whole fiasco could have been avoided if WG2 had accepted BYTE ORDER MARK
as the name (and implied function) of U+FEFF back in 1991, but in the
rush to consummate the merger of the standards, a compromise name (and
implied second function) of ZERO WIDTH NO-BREAK SPACE was accepted by WG2.
No one at the 1991 and 1992 WG2 meetings anticipated that this overloading
would end up causing so many people so much grief -- but it happened, and
we are having to live with it.

The decision by UTC to disunify these bizarrely unified functions, however,
should allow us to gradually dig our way out of this mess.

If you can at all help it, start refraining now from using U+FEFF as a
zero-width non-breaking space. You cannot yet conformantly use U+2060
in that function, but the UTC is driving hard to get that character
(along with other new characters) into the first amendment to ISO/IEC 10646-1:2000,
so we can have a rational resolution of this problem.

> The standard could then say that
> applications should discard all occurences of U+FEFF when reading a
> file, and allow applications to insert U+FEFF at arbitrary points when
> writing a Unicode file.

Insertion of U+FEFF at arbitrary positions is *still* not a good
practice, even with the disunification of functions for the character.
Doing so could still screw up binary string comparisons, create a
rash of buffer overrun bugs in software, destroy checksums, and
render digital signatures unreliable. Be *very* careful of the context
in which you prepend U+FEFF to Unicode data, and don't overuse it.

>
> I suspect that deprecating U+FEFF is not politically acceptable for
> Unicode and ISO 10646, though.

See discussion above.

> Just as uninteresting and just as annoying. The difference being that
> we've had over twenty years to learn to deal with CR/LF mismatches
> (and fixed-length records, and Fortran carriage control). The BOM
> issue opens a whole new area to make new mistakes in.

Exactly. And boy, people are starting to make those mistakes -- as
was perfectly predictable.

>
> (Who should I contact to register ``UCS-4PDP11'', the mixed-endian
> form of UCS-4?)

Prof. Bit Bucket, the patron saint of IT lost causes.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT