Re: UTF-8 BOM (Re: Charset declaration in HTML)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Tue, 17 Jul 2012 20:37:07 +0200

2012/7/17 Julian Bradfield <jcb+unicode_at_inf.ed.ac.uk>:
> On 2012-07-16, Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
>> I am also convinced that even Shell interpreters on Linux/Unix should
>> recognize and accept the leading BOM before the hash/bang starting
>> line (which is commonly used for filetype identification and runtime
> The kernel doesn't know or care about character sets. It has a little
> knowledge of ASCII (or possibly EBCDIC) hardwired, but otherwise it deals
> with 8-bit bytes. It has no concept of "text file".

Yes I know. But most tools and script should know on which type of
file they are operating on. Unfortunately the tools are as well
agnostic and just rely on things that do not pass the transport
protocols. Such as filename conventions.

Content signatures are a well established practice ; even the
hash-bang type is just one of these many signatures, and I don't see
why the tools that are inspecting these data signatures to determine
their behavior cannot support more signatures. The UTF-8 BOM is
generic enough and used now in so many contexts or inserted on the fly
that I don't see the rationale of not accepting it when it now
certainly overwhelms in terms of volumes the contents tagged
internally with a hadh-bang for Linux/Unix shells.

> A file to be interpreted by a hashbang could in principle contain
> arbitrary binary stuff, be that text in multiple encodings or just
> binary data. That stuff belongs to the input to the interpreter, not
> to the hashbang line: that line contains a filename which is not
> intepreted in any extended charset.

And why not ? You could still use UTF-8 encoded text in the command
line given in this hash-bang line, to supply text parameters or
information as well as leaving the rest uninterpreted by the shell but
left to the tool that will be run with this supplied command line. If
the rest of the file is a text script, it can continue being
interpreted using the same UTF-8 encoding detected, independantly of
the user's locale or console settings. Of course this also requires
collaboratoin with the tool executed from the supplied command line,
but I see no exclusion about why these scripts (and the underlying
filesystems when running in the new locale supplied) , cannot run with
UTF-8 internally and natively (notably shell interpreters).
Received on Tue Jul 17 2012 - 13:39:19 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 17 2012 - 13:39:19 CDT