Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 17:51:29 CST

Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

Previous message: Christopher Fynn: "Re: 32'nd bit & UTF-8"
In reply to: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2005/01/19 21:37, Peter Kirk at peterkirk@qaya.org wrote:

>>> Maybe. Nevertheless, they exist, not only as a result of unintelligent
>>> conversion from UTF-16 or UTF-32 to UTF-8, but also because at least one
>>> UTF-8 editor, Notepad on Windows 2000 (and XP?), always emits a BOM at
>>> the start of a UTF-8 file.

>> Well, it seems easier to change that single editor, then. ...

> It's not easy to change a program with an installed base in the hundreds
> of millions worldwide! But I suppose it could be done as part of a
> Windows service pack etc.

It would be strange if one MS couldn't provide an upgrade for such a small
software change, especially since one updates all other software.

> But that assumes that everyone would agree that this change would be a
> good idea. Oliver doesn't, and he makes a good point.

Well, isn't that a problem for MS then? BOM's screw up the UNIX platforms,
so it is not going to honored there anyway.

>> ... Or write a program
>> that removes it at need. Note however that most tools will just act on byte
>> streams. If there is a generated lexer involved, if correctly written, it
>> will generate an error for anything that is not correct. On the BOM
>> question, some fellows simply wants the BOM's to be ignored.

> I thought everyone was required to ignore BOM's, as soon as the encoding
> has been determined.

The problem is that UNIX software looks at the first bytes to determine if
it is a shell script. This relies on the special property of the original
UTF-8 that it is the identity on ASCII data. By requiring a BOM, it is no
has this ASCII compatibility property. And lexers that are made for ASCII
data will most likely treat a BOM as an error.

Hans Aberg

Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"
Previous message: Christopher Fynn: "Re: 32'nd bit & UTF-8"
In reply to: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 17:52:44 CST