Re: Variations of UTF-16 (was: Re: "UNICODE BOMBER STRIKES AGAIN")

From: David Starner (starner@okstate.edu)
Date: Wed Apr 24 2002 - 14:13:41 EDT


On Wed, Apr 24, 2002 at 01:37:39PM -0400, jarkko.hietaniemi@nokia.com wrote:
> Err, no. That's not the point, AFAIK. The point is that traditionally
> in UNIX there hasn't been any sort of "marker" or "tag" in the beginning,
> UNIX files being flat streams of bytes. The UNIX toolset has been built
> with this principle in mind. No metadata in the files. BOM breaks this.

Not at all true. Look at the head of a PNM file, a quintessentailly Unix
file format. PNM, MP3 or PNG files all have metadata identifying them,
and don't break under Unix systems.
 
> wc -c file1
>
> would have to skip the BOM not get the a wrong byte count.
>
> sort -o file5 file1
>
> would have to strip the BOM from file1 (but put in pack into file5?)

The wrong byte count? wc -c file1 is basically meaningless on a Unicode
file, but at least you can assume it gives the _byte count_ (including
extraneous things like BOMs).

More importantly, how do these programs handle newlines? wc -l counts the
number of \x0A's in the file; sort splits the file based on \x0A. This
will produce nothing of value on a UTF-16 file. They could be changed to
work with UTF-16, but they won't be, as UTF-8 works just fine.

The point about file calling it data, not text, was just this; you can't
expect to throw UTF-16 through text tools and get a meaningful result.
That's why UTF-8 was created. The only sane thing to do with a UTF-16
file on Unix is treat as binary data, just like you would a
word-processor file. (Which are stunningly non-Unix, but coming
nonetheless. Probably for the best, though.)

-- 
David Starner - starner@okstate.edu
"It's not a habit; it's cool; I feel alive. 
If you don't have it you're on the other side." 
- K's Choice (probably referring to the Internet)



This archive was generated by hypermail 2.1.2 : Wed Apr 24 2002 - 15:28:27 EDT