Re: Unicode and Kermit

From: John Cowan (cowan@locke.ccil.org)
Date: Sun Aug 08 1999 - 22:54:45 EDT


Frank da Cruz scripsit:

> I am assuming that if a BOM is present, I should believe it (is that a good
> idea?).

Yes.

> Furthermore I should strip the BOM prior to sending, since it serves
> no purpose on the wire. The file sender's main task is to ensure the bytes
> go out in the right order.

A reasonable choice.

> So when reading a UCS-2 file, the file sender should:

You should talk of UTF-16, not UCS-2.

> 1. Use the BOM if found.
>
> 2. If there is no BOM, assume the local machine's endianness unless
> instructed otherwise.

The Unicode Standard says to assume big-endian unless instructed
otherwise. AFAIK the commonest little-endian case is Microsoft, and a
BOM will be present then.

> So looks like any data transfer system involving UCS-2 needs controls to
> force byte swapping at either end, and to write or not write a BOM to the
> destination file.

Always writing a BOM is a safe choice, because a BOM is semantically
zero-width no-break space, which is essentially a no-op.

> What about defaults and precedence? When reading a UCS-2 or UTF-8 file,
> should the BOM always override any global settings or preferences?

Yes. The swapped BOM is a non-character, so it can't appear in a
well-formed UTF-16 file. But you shouldn't have to byte swap a UTF-8
file that appears to begin with a U+FFFE.

> When writing out a UCS-2 file, should we write a BOM by default or only on
> request? What about UTF-8?

UCS-2 yes by default (IMHO always). UTF-8, no by default (IMHO always).

> Finally, about UTF-8 -- there has been some talk recently about "shortest
> sequences". Do the words at the top of page A-8 of The Unicode Standard 2.0
> still apply? It would seem they are consonant with the well-known dictum
> "Be conservative in what you send, liberal in what you accept".

Yes. Always use the shortest sequence; nothing else is conformant.
It is up to you whether you accept long sequences or not; they are
invalid, so it's a matter of your philosophy of error recovery.

-- 
John Cowan                                   cowan@ccil.org
       I am a member of a civilization. --David Brin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT