Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 19:55:44 CST

Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"

Previous message: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 01:13 +0000 2005/01/20, Peter Kirk wrote:
>>Well, isn't that a problem for MS then? BOM's screw up the UNIX platforms,
>>so it is not going to honored there anyway.

>I don't usually leap to the defence of Microsoft, but I don't see why
>you are insisting here, and repeating yourself in other messages, that
>this is Microsoft's problem and not Unix's. True, use of a BOM with
>UTF-8 is not generally recommended, but it is permitted to disambiguate
>an unmarked character set, which is precisely how Microsoft is using it.
>See the following from the Unicode standard section 15.9, p.401:

It is just that it is in effect a file encoding format, not a character
encoding format, originally tied to the MS OS. Unicode should not promote
any specific OS over another. Plain text files do not have a BOM, period.

>> Although there are never any questions of byte order with UTF-8 text,
>> this sequence [the BOM in UTF-8] can serve as signature for UTF-8
>> encoded text where the character set is unmarked.
>
>The implication of this is that the BOM signature at the beginning of a
>UTF-8 text stream must be interpreted as a BOM, rather than as the
>character U+FEFF, whenever the stream is not explicitly marked as UTF-8.
>And this of course includes plain text files which may have been
>generated by Notepad or a similar program.

So then one in effect has to rewrite the whole UNIX operative system, in
order to ensure that and UTF-8 compliance. Without the BOM, little changes
need to be done. There is no gain of using a BOM on a UNIX platform. The
system is not built up around streams, so in general there is no way to know
what the marker is. See the problems discussed in
<http://www.cl.cam.ac.uk/~mgk25/unicode.html> and in other posts here (by
Marcin 'Qrczak' Kowalczyk).

>And that further implies that UNIX systems ought to recognize and
>discard the BOM sequence at the start of plain text files. If UNIX does
>not do so, it is UNIX which is failing to implement Unicode properly,
>not Windows.

Right now, this is so. But clearly implementors of UNIX will not rewrite the
whole OS just to accommodate a single inhouse file format on another
platform, just as MS would not have rewritten its OS if Unicode dictated
that the \r\n combination to be illegal in UTF-8 files..

>>The problem is that UNIX software looks at the first bytes to determine if
>>it is a shell script. This relies on the special property of the original
>>UTF-8 that it is the identity on ASCII data. By requiring a BOM, it is no
>>has this ASCII compatibility property. ...

>This is a very significant point. Because a BOM may be used with UTF-8,
>UTF-8 is in fact not quite as compatible with ASCII as has been
>presumed.

Right. If one does not make UTF-8 fully compatible with ASCII this way, one
can just as well scrap the compatibility with ASCII on the whole, and make a
wholly new, perhaps better, encoding.

> It seems that certain UNIX libraries and utilities need to be
>enhanced to ignore an initial BOM as specified by Unicode, and recognize
>as "the first bytes" those immediately following the BOM. You may reply
>that this is not going to happen, but it may have to happen if UNIX is
>to support Unicode properly.

The catch is that the problem is much deeper than just rewriting some pieces
software: One has to go in and altering the well established behavior of the
OS itself.

>>... And lexers that are made for ASCII
>>data will most likely treat a BOM as an error.

>Well, maybe, or maybe as something like "the sequence <i diaeresis,
>guillemet, inverted question mark> ’ÄúˆØ ¬ª ¬ø’Äù ", to quote the same page of
>the Unicode standard. If so, I'm sorry to say, so much for your old
>program, you need to upgrade to the world of Unicode.

But UNIX programs should not need to be updated because of an MS inhouse
file format.

Hans Aberg

Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Previous message: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 19:57:40 CST