Re: Subject: Re: 32'nd bit & UTF-8

From: Peter Kirk (peterkirk@qaya.org)
Date: Wed Jan 19 2005 - 19:13:28 CST

Next message: Peter Constable: "RE: Subject: Re: 32'nd bit & UTF-8"

Previous message: Kenneth Whistler: "Re: 32'nd bit & UTF-8"
In reply to: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 19/01/2005 23:51, Hans Aberg wrote:

> ...
>
>Well, isn't that a problem for MS then? BOM's screw up the UNIX platforms,
>so it is not going to honored there anyway.
>
>

I don't usually leap to the defence of Microsoft, but I don't see why
you are insisting here, and repeating yourself in other messages, that
this is Microsoft's problem and not Unix's. True, use of a BOM with
UTF-8 is not generally recommended, but it is permitted to disambiguate
an unmarked character set, which is precisely how Microsoft is using it.
See the following from the Unicode standard section 15.9, p.401:

> Although there are never any questions of byte order with UTF-8 text,
> this sequence [the BOM in UTF-8] can serve as signature for UTF-8
> encoded text where the character set is unmarked.

The implication of this is that the BOM signature at the beginning of a
UTF-8 text stream must be interpreted as a BOM, rather than as the
character U+FEFF, whenever the stream is not explicitly marked as UTF-8.
And this of course includes plain text files which may have been
generated by Notepad or a similar program.

And that further implies that Unix systems ought to recognise and
discard the BOM sequence at the start of plain text files. If Unix does
not do so, it is Unix which is failing to implement Unicode properly,
not Windows.

> ...
>
>>I thought everyone was required to ignore BOM's, as soon as the encoding
>>has been determined.
>>
>>
>
>The problem is that UNIX software looks at the first bytes to determine if
>it is a shell script. This relies on the special property of the original
>UTF-8 that it is the identity on ASCII data. By requiring a BOM, it is no
>has this ASCII compatibility property. ...
>

This is a very significant point. Because a BOM may be used with UTF-8,
UTF-8 is in fact not quite as compatible with ASCII as has been
presumed. It seems that certain Unix libraries and utilities need to be
enhanced to ignore an initial BOM as specified by Unicode, and recognise
as "the first bytes" those immediately following the BOM. You may reply
that this is not going to happen, but it may have to happen if Unix is
to support Unicode properly.

>... And lexers that are made for ASCII
>data will most likely treat a BOM as an error.
>
>
>
Well, maybe, or maybe as something like "the sequence <i diaeresis,
guillemet, inverted question mark> “ï » ¿” ", to quote the same page of
the Unicode standard. If so, I'm sorry to say, so much for your old
program, you need to upgrade to the world of Unicode.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/
-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.7.0 - Release Date: 17/01/2005

Next message: Peter Constable: "RE: Subject: Re: 32'nd bit & UTF-8"
Previous message: Kenneth Whistler: "Re: 32'nd bit & UTF-8"
In reply to: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 20:10:19 CST