RE: Variations of UTF-16 (was: Re: "UNICODE BOMBER STRIKES AGAIN")

From: jarkko.hietaniemi@nokia.com
Date: Wed Apr 24 2002 - 13:37:39 EDT


> Why? The problems with a BOM in UTF-8 have to do with it being an
> ASCII-compatible encoding.

Err, no. That's not the point, AFAIK. The point is that traditionally
in UNIX there hasn't been any sort of "marker" or "tag" in the beginning,
UNIX files being flat streams of bytes. The UNIX toolset has been built
with this principle in mind. No metadata in the files. BOM breaks this.

  cat file1 file2 file3 > file4

will have three BOMs, two of them in the middle of file4.

  wc -c file1

would have to skip the BOM not get the a wrong byte count.

  sort -o file5 file1

would have to strip the BOM from file1 (but put in pack into file5?)

And so forth.

If you have a "multifork" filesystem, you can do tagging like this easily
since the "real payload" doesn't get mixed with the metadata. But traditional
UNIX filesystems do not have multifork filesystems.



This archive was generated by hypermail 2.1.2 : Wed Apr 24 2002 - 14:25:39 EDT