Re: Unicode and Kermit

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Mon Aug 09 1999 - 12:26:42 EDT


John Cowan wrote:

> The Unicode Standard says to assume big-endian unless instructed
> otherwise. AFAIK the commonest little-endian case is Microsoft, and a
> BOM will be present then.
>
How about when receiving data to be stored as UCS-2? Which byte order
should be used BY DEFAULT? If I am receiving the file on (say) Windows 98
(which is Intel only), should I store it with little-endian byte order?
Whereas on a Sparc (with any OS) I should write big-endian?

When you say "a BOM will be present" does this mean BOMs are mandatory for
UCS-2/UTF-16 files on Windows 95/98/NT/2000? Is there a reference for this?

What are the real-world conventions for storing UCS-2/UTF-16 data?...

 . Store it according to the endianness of the hardware platform?
 . Store it in little-endian format if Windows, otherwise big-endian?
 . Every platform and/or application has its own rules?

For example, do Windows NT on Intel and MIPS (before it was canceled) use
the same or opposite byte order (MIPS is big-endian)? If they use opposite
byte order, what would that mean for file sharing? Ditto for (say) NFS
mounts between platforms of opposite endianness.

In the real world, do UCS-2 files always start with a BOM? Do all
applications that handle UCS-2 handle the BOM and swap bytes if necessary?
When UCS-2 files do not have a BOM, how do applications handle byte order?

> ... The swapped BOM is a non-character, so it can't appear in a
> well-formed UTF-16 file. But you shouldn't have to byte swap a UTF-8
> file that appears to begin with a U+FFFE.
>
You shouldn't swap UTF-8 anyway, right? In fact, what's the point of the
UTF-8 BOM since byte order is not an issue with UTF-8? (And since, after
all, any file can begin with EF BB BF, or FFFE for that matter...)

I suppose on a file system that contains only Unicode text, the BOMs serve
to identify the transformation format, but on mixed file systems they are
not a good indicator of anything unless we already know that it's Unicode
text.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT