Re: UTF-8 'BOM'

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Jan 20 2005 - 10:40:30 CST

Next message: Rick McGowan: "Re: UTF-8 'BOM'"

Previous message: Rick McGowan: "Re: 32'nd bit & UTF-8"
In reply to: gpw@uniserve.com: "Re: UTF-8 'BOM'"
Next in thread: Hans Aberg: "Re: UTF-8 'BOM'"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

<gpw at uniserve dot com> wrote:

>> I enjoy slagging off Microsoft as much as anyone, but this is really
>> out of place here. Microsoft did not invent the BOM. Rather, they
>> correctly implemented the Unicode Standard. If the Unicode Standard
>> were different in this regard, I'm sure that MS text files would
>> follow suit.
>
> This is slightly revisionist. Long, long ago there were only big-
> endian encoding schemes with the BOM available to help detect
> problems. Microsoft insisted on writing datafiles on Intel platforms
> in a little-endian format. Once this practice was entrenched, the
> standard renamed the old defined practice as big-endian, documented
> the little-endian version and created a third with the BOM at the
> beginning to let people cope with finding either.

This is quite revisionist, at least the first part. My copy of Unicode
1.0, Volume 1 (first printing, October 1991) describes the BOM as a tool
to help detect the byte order of Unicode text and to suggest that the
byte order by swapped in case of mismatches.

There is a statement (p. 22) that "in Public Interchange and in the
absence of any information to the contrary provided by a higher
protocol, a conformant process may assume that Unicode character
sequences it receives are in the order of the most significant byte
first." However, the passage goes on to state that this "canonical byte
order" was limited in scope to public interchange across different
platforms (which in 1991 was much rarer than today), and mentions the
use of BOM as a way for the receiving process to determine the byte
order used by the sending process. There is no mention of "problems"
and no implication that big-endian was the only acceptable format.

I think 1991 counts as "long, long ago" in Unicode. Maybe someone has
information going back farther than that, perhaps Joe Becker, or perhaps
Ken or Asmus or Rick or Mark (who were there at the beginning).

Indeed, the real blow to BOM usability came a year later, when the
merger with ISO/IEC 10646 (resulting in Unicode 1.1) introduced the
overloading of U+FEFF as "zero-width no-break space." This was what
really prevented processes from being able to strip U+FEFF blindly
(Unicode 1.0 had encouraged this practice, though only at the beginning
of a stream). Now that U+2060 WORD JOINER has been created to replace
the ZWNBSP role of U+FEFF, it is possible (IMHO) that that secondary
usage might be deprecated in the future, allowing U+FEFF to be just a
BOM again.

Geoffrey is correct that the *named* forms UTF-16LE and UTF-16BE,
without BOM, and the creation of an encoding scheme called UTF-16 with
BOM, were added many years later to reduce confusion over byte polarity
in publicly interchanged data. But this was still not a matter of
changing the standard to kowtow to Microsoft. Little-endian
architectures exist in the world as well as big-endian architectures,
and software built to run on a given architecture usually follows the
byte order of the hardware. This basic reality goes back long before
Unicode.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Rick McGowan: "Re: UTF-8 'BOM'"
Previous message: Rick McGowan: "Re: 32'nd bit & UTF-8"
In reply to: gpw@uniserve.com: "Re: UTF-8 'BOM'"
Next in thread: Hans Aberg: "Re: UTF-8 'BOM'"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 10:41:53 CST