Microsoft Unicode Text File Byte Order Mark (BOM) FAQ

From: Shlomi Tal (shlompi@hotmail.com)
Date: Mon Apr 08 2002 - 13:23:19 EDT

Previous message: Kenneth Whistler: "Re: hentaigana"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This is a FAQ answer-list of mine regarding the interchange of Unicode text
files between operating systems, particularly the issue of the BOM. It has
already been posted in comp.std.internat, reviewed (by Markus Kuhn, among
others) and amended. I hope this will be of use to someone.

--- BEGIN ---

by Shlomi Tal (shlompi@hotmail.com)

Contents

1. What is a BOM?
2. Why does it matter?
3. Is the BOM mandatory or optional?
---------------------------------------------------------------------

1. What is a BOM?
^^^^^^^^^^^^^^^^^

BOM, or Byte-Order Mark, is a signature at the beginning of a Unicode
text file. Since different processors handle sequences of bytes in a
particular way, the BOM is used to mark which byte-order the text file
was written in.

Processors are either big-endian or little-endian. The former put the
most significant byte first, and the latter put the least significant
byte first. So that the 16-bit number 0x071F is serialized as:

Big-endian 07 1F
Little-endian 1F 70

Obviously a code with the value 0x071F will be interpreted as 0x1F70
if it passes from a processor of different byte-order without
information about its original state. This is what the Unicode BOM
seeks to avoid.

The Unicode standard permits the character U+FEFF (Zero-Width
Non-Breaking Space) at the beginning of the file as a mark for the
byte order of the file. A Unicode text file beginning with FEFF is
big-endian, and a file beginning with FFFE (not a legal Unicode
character for any other purpose) is little-endian.

All this is relevant to the 16-bit and 32-bit encodings of Unicode
characters - UTF-16 and UTF-32 respectively. Thus:

FE FF is UTF-16 Big-Endian
FF FE is UTF-16 Little-Endian
00 00 FE FF is UTF-32 Big-Endian
FF FE 00 00 is UTF-32 Little-Endian

There is another, very common Unicode encoding scheme called UTF-8,
which maps the Unicode repertoire into sequences of bytes. Since the
order of bytes (as opposed to words of more than one byte) is the same
for all processors, UTF-8 does not require a BOM. It can have one,
though.

In addition, a Unicode encoding scheme named UTF-7, which was meant as
a mail-safe encoding but is now nearly obsolete, can have a BOM as
well. Here too the BOM is not mandatory.

2. Why does it matter?
^^^^^^^^^^^^^^^^^^^^^^

It matters because Microsoft tools (most prominently Windows Notepad)
prefix the BOM to Unicode text files regularly, whereas other systems
and environments (Unix, Linux, web pages) are better off without the
BOM, especially in the case of UTF-8 text files.

Unix systems, for example, search for an initial #! in a shell script
file in order to determine the interpreter for it. An initial BOM
coming instead of the #! could easily disrupt this convention. Also,
and this applies particularly to databases, and not only in Unix, the
BOM can cause disorder when files are merged. Web pages usually use
UTF-8, and although they can handle the BOM, it may appear as a
strange character (a blank square or a question mark) on a browser
that doesn't recognize it, and may also cause the above troubles when
the file is saved to the local disk.

Most of the Unicode text meant for open transfer between various
systems (and the Web) is encoded in UTF-8. Unix systems regularly form
UTF-8 text files without the BOM, but Windows systems prefix the BOM
as usual. Here follows an explanation of when the Unicode BOM can or
cannot be removed from text files on Microsoft Windows systems.

3. Is the BOM mandatory or optional?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Microsoft Windows, beginning with the Unicode-supporting operating
systems Windows 2000 and Windows XP, can handle UTF-16 Little-Endian,
UTF-16 Big-Endian, UTF-8 and old 8-bit "ANSI" (Microsoft's
non-standard name for its 8-bit Windows codepages, consisting of the
ASCII repertoire for the first 128 characters and varying characters
for the other 128). The native encoding for these systems is UTF-16
Little-Endian, which when saving under Notepad is called "Unicode".
UTF-16 Big-Endian is called "Unicode Big-Endian", and UTF-8 keeps its
name.

Upon saving a Unicode text file in Notepad, the BOM is always
prefixed. Thus, opening such a file with a text editor which is not
Unicode-aware (such as edit.com) or doing a hexdump on it, you will
see UTF-16 Little-Endian ("Unicode") starting with FF FE, UTF-16
Big-Endian ("Unicode Big-Endian") starting with FE FF, and UTF-8
starting with the UTF-8 encoding of the BOM: EF BB BF.

For the first two encoding schemes (UTF-16), the user MUST NOT remove
the BOM manually. Removing the BOM using an external tool (such as
edit.com) and then opening the file with Notepad will reveal a pile of
gibberish. Then, saving the file will corrupt it beyond recovery. This
is because the BOM is necessary for the system to read the 16-bit
values as they are and ignore their values as 8-bit sequences. Without
the BOM, an 8-bit sequence forming part of a 16-bit Unicode character
will be given its special ASCII value, which may be a control
character. Many of these are transcoded into graphic ASCII characters
when the file is saved again, and thus the original text is lost.
Since UTF-16 text files are not meant for open transfer anyway, this
is not an important issue. As for database applications and other
situations where text files are merged, a Unicode-aware application
should be able to discard all following U+FEFF characters.

For UTF-8, Windows Notepad prefixes the sequence EF BB BF, but it is
not mandatory. The sequence does not signal byte-order, but just that
the file is in UTF-8 encoding, and strictly speaking is not necessary
at all. In fact, Notepad can identify a text file as UTF-8 if it
contains no illegal UTF-8 sequences. One Latin-1 accented European
vowel standing alone in the text already prevents the text from being
recognized as UTF-8. See for yourself: type ALT+0206 ALT+0177 (that
is, those numbers with the ALT key held) on an empty text file, save
and close it. The next time you open the file you will see a Greek
small letter alpha in it - the file has been converted to UTF-8,
though the BOM has not yet been added. Writing more and saving the
file a second time will cause the BOM to be prefixed.

Thus, when writing UTF-8 files for open transfer, it is best to keep
the BOM until the text file is complete, and then the BOM can be
safely removed (the author does so for all his HTML files: writing
with the BOM until completion, then removing it using the Vim editor,
which since version 6.0 can handle UTF-8). Upon making further changes
to the file, remember to remove the BOM again.

So the rules are:

1) Do not remove the BOM (FF FE or FE FF) from UTF-16 files.
2) Removing the BOM (EF BB BF) from UTF-8 is allowed.

Finally, as a side note, and not of any importance, UTF-7 files can
have a BOM too: 2B 2F 76 38 2D (ASCII +/v8-). UTF-7 files are no
special type under Windows, they are saved as "ANSI", as if they were
regular ASCII or Latin-1 text. The UTF-7 BOM is useful only for
testing a UTF-7 encoded text file when dragging it into Internet
Explorer (5 and upwards), which recognizes the BOM and promptly sets
its encoding to UTF-7. However, given that the UTF-7 encoding has so
little use (in our day of 8-bit clean systems, which let data with the
high bit on pass uncorrupted), this can only serve as a piece of
trivia.

--- END ---

_________________________________________________________________
MSN Photos is the easiest way to share and print your photos:
http://photos.msn.com/support/worldwide.aspx

Previous message: Kenneth Whistler: "Re: hentaigana"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Mon Apr 08 2002 - 14:27:27 EDT