Re: Subject: Re: 32'nd bit & UTF-8

From: Peter Kirk (peterkirk@qaya.org)
Date: Wed Jan 19 2005 - 19:13:28 CST

  • Next message: Peter Constable: "RE: Subject: Re: 32'nd bit & UTF-8"

    On 19/01/2005 23:51, Hans Aberg wrote:

    > ...
    >
    >Well, isn't that a problem for MS then? BOM's screw up the UNIX platforms,
    >so it is not going to honored there anyway.
    >
    >

    I don't usually leap to the defence of Microsoft, but I don't see why
    you are insisting here, and repeating yourself in other messages, that
    this is Microsoft's problem and not Unix's. True, use of a BOM with
    UTF-8 is not generally recommended, but it is permitted to disambiguate
    an unmarked character set, which is precisely how Microsoft is using it.
    See the following from the Unicode standard section 15.9, p.401:

    > Although there are never any questions of byte order with UTF-8 text,
    > this sequence [the BOM in UTF-8] can serve as signature for UTF-8
    > encoded text where the character set is unmarked.

    The implication of this is that the BOM signature at the beginning of a
    UTF-8 text stream must be interpreted as a BOM, rather than as the
    character U+FEFF, whenever the stream is not explicitly marked as UTF-8.
    And this of course includes plain text files which may have been
    generated by Notepad or a similar program.

    And that further implies that Unix systems ought to recognise and
    discard the BOM sequence at the start of plain text files. If Unix does
    not do so, it is Unix which is failing to implement Unicode properly,
    not Windows.

    > ...
    >
    >>I thought everyone was required to ignore BOM's, as soon as the encoding
    >>has been determined.
    >>
    >>
    >
    >The problem is that UNIX software looks at the first bytes to determine if
    >it is a shell script. This relies on the special property of the original
    >UTF-8 that it is the identity on ASCII data. By requiring a BOM, it is no
    >has this ASCII compatibility property. ...
    >

    This is a very significant point. Because a BOM may be used with UTF-8,
    UTF-8 is in fact not quite as compatible with ASCII as has been
    presumed. It seems that certain Unix libraries and utilities need to be
    enhanced to ignore an initial BOM as specified by Unicode, and recognise
    as "the first bytes" those immediately following the BOM. You may reply
    that this is not going to happen, but it may have to happen if Unix is
    to support Unicode properly.

    >... And lexers that are made for ASCII
    >data will most likely treat a BOM as an error.
    >
    >
    >
    Well, maybe, or maybe as something like "the sequence <i diaeresis,
    guillemet, inverted question mark> “ï » ¿” ", to quote the same page of
    the Unicode standard. If so, I'm sorry to say, so much for your old
    program, you need to upgrade to the world of Unicode.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    -- 
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.300 / Virus Database: 265.7.0 - Release Date: 17/01/2005
    


    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 20:10:19 CST