RE: Byte Order Marks

From: Yves Arrouye (yves@realnames.com)
Date: Fri Apr 20 2001 - 04:15:24 EDT


> On Thu, Apr 19, 2001 at 06:24:47PM -0700, Markus Scherer wrote:
> > On the other hand, if you get a file from your platform and
> it is in 16-bit Unicode, then you would appreciate the
> convenience of the auto-endian alias.
>
> But nothing should be spitting out platform-endian UTF-16! In the
> case that there's a lot of unmarked big-endian UTF-16 around (as I
> understand the ISO-10646 standard recommends), then that assumption
> that everything emits unmaked platform-dependent UTF-16 will be
> wrong.

And for reference, on Windows, Unicode files are recognized because they
have a BOM. Write plain UTF-16LE w/o a BOM, and your file won't be
recognized properly. Manipulation of these files w/ ICU today is a bit
painful, since one needs to strip the BOM on input (if I understand Markus
correctly) and write a BOM at output. So these cannot be manipulated using
applications like uconv which blindly uses the raw converters.

YA



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT