From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 14:47:01 CST
On 2005/01/20 16:30, Peter Kirk at peterkirk@qaya.org wrote:
>Unicode refers
> only to "a stream of coded characters (such as a file)" and "UTF-8
> encoded text".
There is at least an informat notion of a plain text file. And that is UTF-8
without a BOM, I feel sure.
>> It appears that Unicode takes old concepts, alter the definitions of them,
>> but retains the old name. This is going to create mess. When introducing new
>> concepts, Unicode should choose new names.
> This is not a new concept. If a file consisting of characters without
> markup etc can be referred to as a "plain text file", and these
> characters are Unicode characters, it is not a new concept to refer to a
> "Unicode plain text file", but a sensible extension of existing terminology.
Yes, but then one cannot have a BOM included in that concept.
>> This would just be one of the problems. The WWW-page quoted other problems.
> Most of which are fixed by my approach. But, as Lars has pointed out,
> the issue is a bit more complex than I realised at first because it
> requires text mode to be specified.
It looks like requiring BOMäs will cause a lot of problems on the UNIX
platform. It would be OK to define a Unicode text file wreuiring a BOM. It
would then not be required in plain text files or UTF-8 processes.
>>> We are not talking about any inhouse file format. We are talking about a
>>> standard file format specified in the Unicode standard.
>> The motivation for introducing it into Unicode was that a single MS text
>> processor used this in order to identify file contents, ...
> It would seem to me extremely surprising that Microsoft products were
> using this file format before it was standardised by Unicode. Is this in
> fact true? Do you have any evidence for it?
I do not know more than what other posters mentioned at the beginning of
this thread. MS uses a little endian approach, therefore BOM's were
important to them in the context of UTF-16 or early 16 bit Unicode. Then the
idea was carried over to UTF-8, that seems to be at least a logical seuqnce
of events.
>> ... something not used
>> on other platforms. So whereas this is clearly a part of the Unicode
>> standard, it is a de facto MS inhouse file format until others start to use
>> it.
> Is it in fact true that this format is not used on any other platforms?
> That would also greatly surprise me!
Posters said originally that it came from a MS text editor that always
stamps BOM's onto files.
> You may be right that the BOM will not help Unicode recognition, but
> there is no doubt that Unicode is being widely recognised in the Unix
> community as well as all other communities. It will very probably become
> dominant within a very few years whether or not there is a BOM - and
> whether or not the text file compatibility problem is fixed.
The UTF-8 without BOM's is already taking off. But formally, in the eyes of
Unicode, that is a corrupted UTF-8.
>> Just as I, and others will, oppose the UTF-8 BOM requirement for good
>> reasons.
> If you wish to propose an amendment to the Unicode standard, there are
> proper ways to do that. You might even succeed, at least in having use
> of the BOM with UTF-8 deprecated - and if you do Microsoft might
> reconsider their strategy although I guess they would first oppose your
> proposed change.
Well, who wants to waste time fighting MS? :-)
>But for the moment UTF-8 with BOM is part of the
> standard whether you like it or not.
But whether you like it or not, UTF-8 with BOMäs will not be used in the
UNIX world.
>> This is not a quote from me. My mail should be in ASCII, as is a usual
>> requirement of technical lists.
> Who says? This is nowhere specified as a requirement for this list.
I just mention my practise, so others may be informed.
> In another message you wrote:
>
>> Somehow that practise of that
>> particular piece of software has slipped into the Unicode UTF-8 standard.
> Please stop repeating this allegation, at least unless you have some
> proof that Microsoft was using this format before Unicode standardised
> it. Anyway, nothing slips into the Unicode standard; everything has to
> be accepted by the UTC, and Unix experts have had ample opportunity to
> object.
As I mentioend before, this is what other posters said. Go to them for
proof.
This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 14:50:19 CST