Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 14:47:01 CST

Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

Previous message: Hans Aberg: "Re: 32'nd bit & UTF-8"
In reply to: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Peter Constable: "RE: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2005/01/20 16:30, Peter Kirk at peterkirk@qaya.org wrote:

>Unicode refers
> only to "a stream of coded characters (such as a file)" and "UTF-8
> encoded text".

There is at least an informat notion of a plain text file. And that is UTF-8
without a BOM, I feel sure.

>> It appears that Unicode takes old concepts, alter the definitions of them,
>> but retains the old name. This is going to create mess. When introducing new
>> concepts, Unicode should choose new names.

> This is not a new concept. If a file consisting of characters without
> markup etc can be referred to as a "plain text file", and these
> characters are Unicode characters, it is not a new concept to refer to a
> "Unicode plain text file", but a sensible extension of existing terminology.

Yes, but then one cannot have a BOM included in that concept.

>> This would just be one of the problems. The WWW-page quoted other problems.

> Most of which are fixed by my approach. But, as Lars has pointed out,
> the issue is a bit more complex than I realised at first because it
> requires text mode to be specified.

It looks like requiring BOMäs will cause a lot of problems on the UNIX
platform. It would be OK to define a Unicode text file wreuiring a BOM. It
would then not be required in plain text files or UTF-8 processes.

>>> We are not talking about any inhouse file format. We are talking about a
>>> standard file format specified in the Unicode standard.

>> The motivation for introducing it into Unicode was that a single MS text
>> processor used this in order to identify file contents, ...

> It would seem to me extremely surprising that Microsoft products were
> using this file format before it was standardised by Unicode. Is this in
> fact true? Do you have any evidence for it?

I do not know more than what other posters mentioned at the beginning of
this thread. MS uses a little endian approach, therefore BOM's were
important to them in the context of UTF-16 or early 16 bit Unicode. Then the
idea was carried over to UTF-8, that seems to be at least a logical seuqnce
of events.

>> ... something not used
>> on other platforms. So whereas this is clearly a part of the Unicode
>> standard, it is a de facto MS inhouse file format until others start to use
>> it.

> Is it in fact true that this format is not used on any other platforms?
> That would also greatly surprise me!

Posters said originally that it came from a MS text editor that always
stamps BOM's onto files.

> You may be right that the BOM will not help Unicode recognition, but
> there is no doubt that Unicode is being widely recognised in the Unix
> community as well as all other communities. It will very probably become
> dominant within a very few years whether or not there is a BOM - and
> whether or not the text file compatibility problem is fixed.

The UTF-8 without BOM's is already taking off. But formally, in the eyes of
Unicode, that is a corrupted UTF-8.

>> Just as I, and others will, oppose the UTF-8 BOM requirement for good
>> reasons.

> If you wish to propose an amendment to the Unicode standard, there are
> proper ways to do that. You might even succeed, at least in having use
> of the BOM with UTF-8 deprecated - and if you do Microsoft might
> reconsider their strategy although I guess they would first oppose your
> proposed change.

Well, who wants to waste time fighting MS? :-)

>But for the moment UTF-8 with BOM is part of the
> standard whether you like it or not.

But whether you like it or not, UTF-8 with BOMäs will not be used in the
UNIX world.

>> This is not a quote from me. My mail should be in ASCII, as is a usual
>> requirement of technical lists.

> Who says? This is nowhere specified as a requirement for this list.

I just mention my practise, so others may be informed.

> In another message you wrote:
>
>> Somehow that practise of that
>> particular piece of software has slipped into the Unicode UTF-8 standard.

> Please stop repeating this allegation, at least unless you have some
> proof that Microsoft was using this format before Unicode standardised
> it. Anyway, nothing slips into the Unicode standard; everything has to
> be accepted by the UTC, and Unix experts have had ample opportunity to
> object.

As I mentioend before, this is what other posters said. Go to them for
proof.

Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"
Previous message: Hans Aberg: "Re: 32'nd bit & UTF-8"
In reply to: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Peter Constable: "RE: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 14:50:19 CST