Frank da Cruz scripsit:
> I am assuming that if a BOM is present, I should believe it (is that a good
> Furthermore I should strip the BOM prior to sending, since it serves
> no purpose on the wire. The file sender's main task is to ensure the bytes
> go out in the right order.
A reasonable choice.
> So when reading a UCS-2 file, the file sender should:
You should talk of UTF-16, not UCS-2.
> 1. Use the BOM if found.
> 2. If there is no BOM, assume the local machine's endianness unless
> instructed otherwise.
The Unicode Standard says to assume big-endian unless instructed
otherwise. AFAIK the commonest little-endian case is Microsoft, and a
BOM will be present then.
> So looks like any data transfer system involving UCS-2 needs controls to
> force byte swapping at either end, and to write or not write a BOM to the
> destination file.
Always writing a BOM is a safe choice, because a BOM is semantically
zero-width no-break space, which is essentially a no-op.
> What about defaults and precedence? When reading a UCS-2 or UTF-8 file,
> should the BOM always override any global settings or preferences?
Yes. The swapped BOM is a non-character, so it can't appear in a
well-formed UTF-16 file. But you shouldn't have to byte swap a UTF-8
file that appears to begin with a U+FFFE.
> When writing out a UCS-2 file, should we write a BOM by default or only on
> request? What about UTF-8?
UCS-2 yes by default (IMHO always). UTF-8, no by default (IMHO always).
> Finally, about UTF-8 -- there has been some talk recently about "shortest
> sequences". Do the words at the top of page A-8 of The Unicode Standard 2.0
> still apply? It would seem they are consonant with the well-known dictum
> "Be conservative in what you send, liberal in what you accept".
Yes. Always use the shortest sequence; nothing else is conformant.
It is up to you whether you accept long sequences or not; they are
invalid, so it's a matter of your philosophy of error recovery.
-- John Cowan email@example.com I am a member of a civilization. --David Brin
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT