Re: Variations of UTF-16 (was: Re: "UNICODE BOMBER STRIKES AGAIN")

From: Jungshik Shin (jshin@mailaps.org)
Date: Wed Apr 24 2002 - 14:38:28 EDT


On Wed, 24 Apr 2002, David Starner wrote:

> On Wed, Apr 24, 2002 at 09:00:17AM -0700, Doug Ewell wrote:
> > The Unix and Linux world is very
> > opposed to the use of BOM in plain-text files, and if they feel that way
> > about UTF-8 they probably feel the same about UTF-16.

 The reason we're not so fond of UTF-8 with BOM is that it 'breaks' a
lot of time-honored Unix command line text-processing tools. The simplest
example is concatenating multiple files with 'cat'. With BOM at the
beginning, the following doesn't work as intended.

  $ cat f1 f2 f3 f4 | sort | uniq | sed '....' > f5

For Sure, by typing a couple of more commands(enclosing 'cat'
with 'for loop', for instance), we can work around that,
but ....

> Why? The problems with a BOM in UTF-8 have to do with it being an
> ASCII-compatible encoding. (I'd guess that if there are any Unixes that
> use EBCDIC, the same problems would apply to UTF-EBCDIC.) Pretty much
> the only reason one would use UTF-16 is to be compatible with a foreign
> system, and then you use the conventions of that system.

 I totally agree with you. We don't expect text tools
to work on files in UTF-16 the same way as we would expect them to work
on files in UTF-8 or other ASCII-compatible encodings.

  Jungshik Shin



This archive was generated by hypermail 2.1.2 : Wed Apr 24 2002 - 15:31:34 EDT