Re: Unicode and Kermit

From: Mark Davis (
Date: Tue Aug 10 1999 - 22:25:27 EDT

> Always writing a BOM is a safe choice, because a BOM is semantically
> zero-width no-break space, which is essentially a no-op.

This is not quite true: BOM is not quite a NO-OP; it does need to be removed
from a file. For example, f I split a file into two, then concatenate, the
result should be identical to the original--it isn't unless I remove the BOM.

Frank, you should also look at the draft Unicode FAQ, since it discusses some of
these issues. Look at There is
also some information there on shortest sequences: you should always generate
them, but you may accept longer ones.


Frank da Cruz wrote:

> John Cowan wrote:
> > The Unicode Standard says to assume big-endian unless instructed
> > otherwise. AFAIK the commonest little-endian case is Microsoft, and a
> > BOM will be present then.
> >
> How about when receiving data to be stored as UCS-2? Which byte order
> should be used BY DEFAULT? If I am receiving the file on (say) Windows 98
> (which is Intel only), should I store it with little-endian byte order?
> Whereas on a Sparc (with any OS) I should write big-endian?
> When you say "a BOM will be present" does this mean BOMs are mandatory for
> UCS-2/UTF-16 files on Windows 95/98/NT/2000? Is there a reference for this?
> What are the real-world conventions for storing UCS-2/UTF-16 data?...
> . Store it according to the endianness of the hardware platform?
> . Store it in little-endian format if Windows, otherwise big-endian?
> . Every platform and/or application has its own rules?
> For example, do Windows NT on Intel and MIPS (before it was canceled) use
> the same or opposite byte order (MIPS is big-endian)? If they use opposite
> byte order, what would that mean for file sharing? Ditto for (say) NFS
> mounts between platforms of opposite endianness.
> In the real world, do UCS-2 files always start with a BOM? Do all
> applications that handle UCS-2 handle the BOM and swap bytes if necessary?
> When UCS-2 files do not have a BOM, how do applications handle byte order?
> > ... The swapped BOM is a non-character, so it can't appear in a
> > well-formed UTF-16 file. But you shouldn't have to byte swap a UTF-8
> > file that appears to begin with a U+FFFE.
> >
> You shouldn't swap UTF-8 anyway, right? In fact, what's the point of the
> UTF-8 BOM since byte order is not an issue with UTF-8? (And since, after
> all, any file can begin with EF BB BF, or FFFE for that matter...)
> I suppose on a file system that contains only Unicode text, the BOMs serve
> to identify the transformation format, but on mixed file systems they are
> not a good indicator of anything unless we already know that it's Unicode
> text.
> - Frank

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT