Re: UTF-8N?

From: Peter_Constable@sil.org
Date: Tue Jun 20 2000 - 14:32:25 EDT


MD>In XML, this situation does not arise, since it specifies the exact
useage of BO M, but it can arise in other circumstances.

Another recent thread suggests that the situation with BOM and XML is, in
fact, *not* clear.

>AL> I understand there is no way to know whether you SHALL/SHOULD/MAY AL>
>delete it or not, but I fail to see the danger: BOM (well, ZWNBSP) AL>
cannot
>carry any useful meaning when it appears at the beginning AL> of a text,
can
>it? So what can be the problem?
>
>You have a large plain-text Unicode file. It doesn't fit on a single
floppy,
>so you split it into two parts. You put the file onto two copies with an
MD5
>checksum to ensure you know if the file gets corrupted.
>
>Later on, you merge the two files, and compute the checksum of the
concatenated
>file. If the program used for splitting inserted a BOM, but the program
used
>for merging didn't remove it, the checksum comparison is going to fail.

Doesn't that simply indicate that, in a protocol that disects a long file
into parts to be transmitted separately, it is inappropriate to add a BOM
to the beginnings of the parts, whether they use UTF-8 or UTF-16? (In John
Cowan's example of U+0020 followed by U+FEFF, the problem applies to both
UTF-8 and UTF-16.) The parts are not individual plain text files; they are
some other type of object which, when assembled in accordance with that
protocol, can produce a plain text file. We shouldn't mix up the use of the
BOM and protocols that are not directly related to Unicode.

Peter Constable



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT