Re: UTF-8 'BOM'

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Jan 20 2005 - 10:40:30 CST

  • Next message: Rick McGowan: "Re: UTF-8 'BOM'"

    <gpw at uniserve dot com> wrote:

    >> I enjoy slagging off Microsoft as much as anyone, but this is really
    >> out of place here. Microsoft did not invent the BOM. Rather, they
    >> correctly implemented the Unicode Standard. If the Unicode Standard
    >> were different in this regard, I'm sure that MS text files would
    >> follow suit.
    >
    > This is slightly revisionist. Long, long ago there were only big-
    > endian encoding schemes with the BOM available to help detect
    > problems. Microsoft insisted on writing datafiles on Intel platforms
    > in a little-endian format. Once this practice was entrenched, the
    > standard renamed the old defined practice as big-endian, documented
    > the little-endian version and created a third with the BOM at the
    > beginning to let people cope with finding either.

    This is quite revisionist, at least the first part. My copy of Unicode
    1.0, Volume 1 (first printing, October 1991) describes the BOM as a tool
    to help detect the byte order of Unicode text and to suggest that the
    byte order by swapped in case of mismatches.

    There is a statement (p. 22) that "in Public Interchange and in the
    absence of any information to the contrary provided by a higher
    protocol, a conformant process may assume that Unicode character
    sequences it receives are in the order of the most significant byte
    first." However, the passage goes on to state that this "canonical byte
    order" was limited in scope to public interchange across different
    platforms (which in 1991 was much rarer than today), and mentions the
    use of BOM as a way for the receiving process to determine the byte
    order used by the sending process. There is no mention of "problems"
    and no implication that big-endian was the only acceptable format.

    I think 1991 counts as "long, long ago" in Unicode. Maybe someone has
    information going back farther than that, perhaps Joe Becker, or perhaps
    Ken or Asmus or Rick or Mark (who were there at the beginning).

    Indeed, the real blow to BOM usability came a year later, when the
    merger with ISO/IEC 10646 (resulting in Unicode 1.1) introduced the
    overloading of U+FEFF as "zero-width no-break space." This was what
    really prevented processes from being able to strip U+FEFF blindly
    (Unicode 1.0 had encouraged this practice, though only at the beginning
    of a stream). Now that U+2060 WORD JOINER has been created to replace
    the ZWNBSP role of U+FEFF, it is possible (IMHO) that that secondary
    usage might be deprecated in the future, allowing U+FEFF to be just a
    BOM again.

    Geoffrey is correct that the *named* forms UTF-16LE and UTF-16BE,
    without BOM, and the creation of an encoding scheme called UTF-16 with
    BOM, were added many years later to reduce confusion over byte polarity
    in publicly interchanged data. But this was still not a matter of
    changing the standard to kowtow to Microsoft. Little-endian
    architectures exist in the world as well as big-endian architectures,
    and software built to run on a given architecture usually follows the
    byte order of the hardware. This basic reality goes back long before
    Unicode.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 10:41:53 CST