Re: Unicode and Kermit

From: John Cowan (
Date: Mon Aug 09 1999 - 13:05:25 EDT

Frank da Cruz scripsit:

> How about when receiving data to be stored as UCS-2? Which byte order
> should be used BY DEFAULT? If I am receiving the file on (say) Windows 98
> (which is Intel only), should I store it with little-endian byte order?
> Whereas on a Sparc (with any OS) I should write big-endian?

In that case I think the native order should win.
But never write a little-endian file without a BOM.

> When you say "a BOM will be present" does this mean BOMs are mandatory for
> UCS-2/UTF-16 files on Windows 95/98/NT/2000? Is there a reference for this?

What is "mandatory"? Windows NT Notepad, with which Win32 users should
expect to interoperate, always writes LE with a BOM. Bogusly, it does
not detect a BE BOM and swap.

> For example, do Windows NT on Intel and MIPS (before it was canceled) use
> the same or opposite byte order (MIPS is big-endian)? If they use opposite
> byte order, what would that mean for file sharing? Ditto for (say) NFS
> mounts between platforms of opposite endianness.

NT on MIPS put the MIPS chip into LE mode.

> In the real world, do UCS-2 files always start with a BOM? Do all
> applications that handle UCS-2 handle the BOM and swap bytes if necessary?

Alas, no (see above). But they should.

> > ... The swapped BOM is a non-character, so it can't appear in a
> > well-formed UTF-16 file. But you shouldn't have to byte swap a UTF-8
> > file that appears to begin with a U+FFFE.
> >
> You shouldn't swap UTF-8 anyway, right? In fact, what's the point of the
> UTF-8 BOM since byte order is not an issue with UTF-8? (And since, after
> all, any file can begin with EF BB BF, or FFFE for that matter...)

In principle, yes. But neither of these is a *probable* sequence.
Some people want to see a UTF-8 BOM so they have more assurance that
a file really is UTF-8, but this is not standardized.

> I suppose on a file system that contains only Unicode text, the BOMs serve
> to identify the transformation format, but on mixed file systems they are
> not a good indicator of anything unless we already know that it's Unicode
> text.

They are a probabilistic indicator of Unicode text, which as such is
very helpful. Windows NT Notepad assumes LE UTF-16 if it sees a LE BOM,
otherwise the current 8-bit code page.

John Cowan                         
       I am a member of a civilization. --David Brin

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT