Re: Subject: Re: 32'nd bit & UTF-8

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Jan 20 2005 - 11:14:15 CST

Next message: Rick McGowan: "Public Review Issue update"

Previous message: Peter Constable: "RE: UTF-8 'BOM'"
In reply to: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 20/01/2005 15:22, Lars Kristan wrote:

> Mark E. Shoulson wrote:
> > I don't see why UNIX can't bend a little.
> > Just check for '#!' *or* 'BOM#!' when you open a file for execution.
>
> It is not as simple as that. If you use cat to concat files, you get
> lots of BOMs in the middle of the files. Next thing you know, you get
> BOMs in the filenames. More BOMs when you ls the files. Then you get
> BOMs following one another, meaning stripping the first one doesn't
> work any more. Should I go on?
>
> But, yes, perhaps UNIX will indeed need to bend. The question is, at
> which level in the architecture is this supposed to happen.
> Filesystem? Network? Run time library? Each program independently?
>
>
I wonder if this is all a bit of a storm in a teacup. When will the
problem actually occur? It seems to be restricted to UTF-8 files
generated by Windows and perhaps some other systems and read by Unix and
perhaps some other systems. I really don't see how BOMs will end up in
filenames - or does Windows put BOMs in filenames?

If these files received from Windows are actual Unicode text, intended
to be rendered and read by humans, there is almost no ill effect if a
file-initial BOM is misinterpreted as an actual character U+FEFF, zero
width no-break space, or vice versa. I don't think a file-initial ZWNBS
ever affects rendering, at least in any real practical situation. There
is a slight possibility of an effect if U+FEFF is incorrectly inserted
or dropped when two text files are concatenated, but this can affect
rendering only if the concatenation takes place in the middle of a line
of text, which is rather unlikely to be intended.

If these files are written in a programming language, including a Unix
shell script language, which like most current programming languages
accepts ASCII characters only, then the three BOM bytes are of course
likely to confuse the parsing engine. If such a file is prepared on
Windows for compilation or execution on Unix, it should be saved in
ASCII or ANSI mode and so will not have a BOM. The only real life
problem here is if there is a need to include UTF-8 literal strings or
comments within the otherwise ASCII-only program.

Any programming language which is designed to accept UTF-8 tokens etc
should also be designed to ignore U+FEFF wherever it occurs in the
source file, even in the middle to allow for concatenations. This would
include Mark's suggestion above in a Unicode-extended shell. I don't see
any practical problems in doing that.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/
-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.7.0 - Release Date: 17/01/2005

Next message: Rick McGowan: "Public Review Issue update"
Previous message: Peter Constable: "RE: UTF-8 'BOM'"
In reply to: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 11:55:54 CST