Re: Subject: Re: 32'nd bit & UTF-8

From: Peter Kirk (
Date: Thu Jan 20 2005 - 11:14:15 CST

  • Next message: Rick McGowan: "Public Review Issue update"

    On 20/01/2005 15:22, Lars Kristan wrote:

    > Mark E. Shoulson wrote:
    > > I don't see why UNIX can't bend a little.
    > > Just check for '#!' *or* 'BOM#!' when you open a file for execution.
    > It is not as simple as that. If you use cat to concat files, you get
    > lots of BOMs in the middle of the files. Next thing you know, you get
    > BOMs in the filenames. More BOMs when you ls the files. Then you get
    > BOMs following one another, meaning stripping the first one doesn't
    > work any more. Should I go on?
    > But, yes, perhaps UNIX will indeed need to bend. The question is, at
    > which level in the architecture is this supposed to happen.
    > Filesystem? Network? Run time library? Each program independently?
    I wonder if this is all a bit of a storm in a teacup. When will the
    problem actually occur? It seems to be restricted to UTF-8 files
    generated by Windows and perhaps some other systems and read by Unix and
    perhaps some other systems. I really don't see how BOMs will end up in
    filenames - or does Windows put BOMs in filenames?

    If these files received from Windows are actual Unicode text, intended
    to be rendered and read by humans, there is almost no ill effect if a
    file-initial BOM is misinterpreted as an actual character U+FEFF, zero
    width no-break space, or vice versa. I don't think a file-initial ZWNBS
    ever affects rendering, at least in any real practical situation. There
    is a slight possibility of an effect if U+FEFF is incorrectly inserted
    or dropped when two text files are concatenated, but this can affect
    rendering only if the concatenation takes place in the middle of a line
    of text, which is rather unlikely to be intended.

    If these files are written in a programming language, including a Unix
    shell script language, which like most current programming languages
    accepts ASCII characters only, then the three BOM bytes are of course
    likely to confuse the parsing engine. If such a file is prepared on
    Windows for compilation or execution on Unix, it should be saved in
    ASCII or ANSI mode and so will not have a BOM. The only real life
    problem here is if there is a need to include UTF-8 literal strings or
    comments within the otherwise ASCII-only program.

    Any programming language which is designed to accept UTF-8 tokens etc
    should also be designed to ignore U+FEFF wherever it occurs in the
    source file, even in the middle to allow for concatenations. This would
    include Mark's suggestion above in a Unicode-extended shell. I don't see
    any practical problems in doing that.

    Peter Kirk (personal) (work)
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.300 / Virus Database: 265.7.0 - Release Date: 17/01/2005

    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 11:55:54 CST