Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 19:55:44 CST

  • Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"

    At 01:13 +0000 2005/01/20, Peter Kirk wrote:
    >>Well, isn't that a problem for MS then? BOM's screw up the UNIX platforms,
    >>so it is not going to honored there anyway.

    >I don't usually leap to the defence of Microsoft, but I don't see why
    >you are insisting here, and repeating yourself in other messages, that
    >this is Microsoft's problem and not Unix's. True, use of a BOM with
    >UTF-8 is not generally recommended, but it is permitted to disambiguate
    >an unmarked character set, which is precisely how Microsoft is using it.
    >See the following from the Unicode standard section 15.9, p.401:

    It is just that it is in effect a file encoding format, not a character
    encoding format, originally tied to the MS OS. Unicode should not promote
    any specific OS over another. Plain text files do not have a BOM, period.

    >> Although there are never any questions of byte order with UTF-8 text,
    >> this sequence [the BOM in UTF-8] can serve as signature for UTF-8
    >> encoded text where the character set is unmarked.
    >
    >The implication of this is that the BOM signature at the beginning of a
    >UTF-8 text stream must be interpreted as a BOM, rather than as the
    >character U+FEFF, whenever the stream is not explicitly marked as UTF-8.
    >And this of course includes plain text files which may have been
    >generated by Notepad or a similar program.

    So then one in effect has to rewrite the whole UNIX operative system, in
    order to ensure that and UTF-8 compliance. Without the BOM, little changes
    need to be done. There is no gain of using a BOM on a UNIX platform. The
    system is not built up around streams, so in general there is no way to know
    what the marker is. See the problems discussed in
    <http://www.cl.cam.ac.uk/~mgk25/unicode.html> and in other posts here (by
    Marcin 'Qrczak' Kowalczyk).

    >And that further implies that UNIX systems ought to recognize and
    >discard the BOM sequence at the start of plain text files. If UNIX does
    >not do so, it is UNIX which is failing to implement Unicode properly,
    >not Windows.

    Right now, this is so. But clearly implementors of UNIX will not rewrite the
    whole OS just to accommodate a single inhouse file format on another
    platform, just as MS would not have rewritten its OS if Unicode dictated
    that the \r\n combination to be illegal in UTF-8 files..

    >>The problem is that UNIX software looks at the first bytes to determine if
    >>it is a shell script. This relies on the special property of the original
    >>UTF-8 that it is the identity on ASCII data. By requiring a BOM, it is no
    >>has this ASCII compatibility property. ...

    >This is a very significant point. Because a BOM may be used with UTF-8,
    >UTF-8 is in fact not quite as compatible with ASCII as has been
    >presumed.

    Right. If one does not make UTF-8 fully compatible with ASCII this way, one
    can just as well scrap the compatibility with ASCII on the whole, and make a
    wholly new, perhaps better, encoding.

    > It seems that certain UNIX libraries and utilities need to be
    >enhanced to ignore an initial BOM as specified by Unicode, and recognize
    >as "the first bytes" those immediately following the BOM. You may reply
    >that this is not going to happen, but it may have to happen if UNIX is
    >to support Unicode properly.

    The catch is that the problem is much deeper than just rewriting some pieces
    software: One has to go in and altering the well established behavior of the
    OS itself.

    >>... And lexers that are made for ASCII
    >>data will most likely treat a BOM as an error.

    >Well, maybe, or maybe as something like "the sequence <i diaeresis,
    >guillemet, inverted question mark> ", to quote the same page of
    >the Unicode standard. If so, I'm sorry to say, so much for your old
    >program, you need to upgrade to the world of Unicode.

    But UNIX programs should not need to be updated because of an MS inhouse
    file format.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 19:57:40 CST