From: Hans Aberg (email@example.com)
Date: Wed Jan 19 2005 - 19:55:44 CST
At 01:13 +0000 2005/01/20, Peter Kirk wrote:
>>Well, isn't that a problem for MS then? BOM's screw up the UNIX platforms,
>>so it is not going to honored there anyway.
>I don't usually leap to the defence of Microsoft, but I don't see why
>you are insisting here, and repeating yourself in other messages, that
>this is Microsoft's problem and not Unix's. True, use of a BOM with
>UTF-8 is not generally recommended, but it is permitted to disambiguate
>an unmarked character set, which is precisely how Microsoft is using it.
>See the following from the Unicode standard section 15.9, p.401:
It is just that it is in effect a file encoding format, not a character
encoding format, originally tied to the MS OS. Unicode should not promote
any specific OS over another. Plain text files do not have a BOM, period.
>> Although there are never any questions of byte order with UTF-8 text,
>> this sequence [the BOM in UTF-8] can serve as signature for UTF-8
>> encoded text where the character set is unmarked.
>The implication of this is that the BOM signature at the beginning of a
>UTF-8 text stream must be interpreted as a BOM, rather than as the
>character U+FEFF, whenever the stream is not explicitly marked as UTF-8.
>And this of course includes plain text files which may have been
>generated by Notepad or a similar program.
So then one in effect has to rewrite the whole UNIX operative system, in
order to ensure that and UTF-8 compliance. Without the BOM, little changes
need to be done. There is no gain of using a BOM on a UNIX platform. The
system is not built up around streams, so in general there is no way to know
what the marker is. See the problems discussed in
<http://www.cl.cam.ac.uk/~mgk25/unicode.html> and in other posts here (by
Marcin 'Qrczak' Kowalczyk).
>And that further implies that UNIX systems ought to recognize and
>discard the BOM sequence at the start of plain text files. If UNIX does
>not do so, it is UNIX which is failing to implement Unicode properly,
Right now, this is so. But clearly implementors of UNIX will not rewrite the
whole OS just to accommodate a single inhouse file format on another
platform, just as MS would not have rewritten its OS if Unicode dictated
that the \r\n combination to be illegal in UTF-8 files..
>>The problem is that UNIX software looks at the first bytes to determine if
>>it is a shell script. This relies on the special property of the original
>>UTF-8 that it is the identity on ASCII data. By requiring a BOM, it is no
>>has this ASCII compatibility property. ...
>This is a very significant point. Because a BOM may be used with UTF-8,
>UTF-8 is in fact not quite as compatible with ASCII as has been
Right. If one does not make UTF-8 fully compatible with ASCII this way, one
can just as well scrap the compatibility with ASCII on the whole, and make a
wholly new, perhaps better, encoding.
> It seems that certain UNIX libraries and utilities need to be
>enhanced to ignore an initial BOM as specified by Unicode, and recognize
>as "the first bytes" those immediately following the BOM. You may reply
>that this is not going to happen, but it may have to happen if UNIX is
>to support Unicode properly.
The catch is that the problem is much deeper than just rewriting some pieces
software: One has to go in and altering the well established behavior of the
>>... And lexers that are made for ASCII
>>data will most likely treat a BOM as an error.
>Well, maybe, or maybe as something like "the sequence <i diaeresis,
>guillemet, inverted question mark> ’ÄúˆØ ¬ª ¬ø’Äù ", to quote the same page of
>the Unicode standard. If so, I'm sorry to say, so much for your old
>program, you need to upgrade to the world of Unicode.
But UNIX programs should not need to be updated because of an MS inhouse
This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 19:57:40 CST