Re: Unicode and Kermit

From: Mark Davis (mark@macchiato.com)
Date: Tue Aug 10 1999 - 22:25:27 EDT


> Always writing a BOM is a safe choice, because a BOM is semantically
> zero-width no-break space, which is essentially a no-op.
>

This is not quite true: BOM is not quite a NO-OP; it does need to be removed
from a file. For example, f I split a file into two, then concatenate, the
result should be identical to the original--it isn't unless I remove the BOM.

Frank, you should also look at the draft Unicode FAQ, since it discusses some of
these issues. Look at http://www.unicode.org/unicode/faq/#Encoding. There is
also some information there on shortest sequences: you should always generate
them, but you may accept longer ones.

Mark

Frank da Cruz wrote:

> John Cowan wrote:
>
> > The Unicode Standard says to assume big-endian unless instructed
> > otherwise. AFAIK the commonest little-endian case is Microsoft, and a
> > BOM will be present then.
> >
> How about when receiving data to be stored as UCS-2? Which byte order
> should be used BY DEFAULT? If I am receiving the file on (say) Windows 98
> (which is Intel only), should I store it with little-endian byte order?
> Whereas on a Sparc (with any OS) I should write big-endian?
>
> When you say "a BOM will be present" does this mean BOMs are mandatory for
> UCS-2/UTF-16 files on Windows 95/98/NT/2000? Is there a reference for this?
>
> What are the real-world conventions for storing UCS-2/UTF-16 data?...
>
> . Store it according to the endianness of the hardware platform?
> . Store it in little-endian format if Windows, otherwise big-endian?
> . Every platform and/or application has its own rules?
>
> For example, do Windows NT on Intel and MIPS (before it was canceled) use
> the same or opposite byte order (MIPS is big-endian)? If they use opposite
> byte order, what would that mean for file sharing? Ditto for (say) NFS
> mounts between platforms of opposite endianness.
>
> In the real world, do UCS-2 files always start with a BOM? Do all
> applications that handle UCS-2 handle the BOM and swap bytes if necessary?
> When UCS-2 files do not have a BOM, how do applications handle byte order?
>
> > ... The swapped BOM is a non-character, so it can't appear in a
> > well-formed UTF-16 file. But you shouldn't have to byte swap a UTF-8
> > file that appears to begin with a U+FFFE.
> >
> You shouldn't swap UTF-8 anyway, right? In fact, what's the point of the
> UTF-8 BOM since byte order is not an issue with UTF-8? (And since, after
> all, any file can begin with EF BB BF, or FFFE for that matter...)
>
> I suppose on a file system that contains only Unicode text, the BOMs serve
> to identify the transformation format, but on mixed file systems they are
> not a good indicator of anything unless we already know that it's Unicode
> text.
>
> - Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT