Re: Subject: Re: 32'nd bit & UTF-8

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Jan 20 2005 - 09:30:11 CST

Next message: Jon Hanna: "RE: Subject: Re: 32'nd bit & UTF-8"

Previous message: Rick McGowan: "Re: UTF-8 'BOM'"
In reply to: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Jon Hanna: "RE: Subject: Re: 32'nd bit & UTF-8"
Reply: Jon Hanna: "RE: Subject: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 20/01/2005 14:14, Hans Aberg wrote:

> ...
>
>>On the contrary, the Unicode standard defines that a BOM should be used
>>at the start of a plain text file under certain circumstances.
>>
>>
>
>This is then a Unicode text file, not a plain text file. Calling it a plain
>text file by Unicode just adds to the confusion.
>
>

The precise terminology here was mine, not Unicode's. Unicode refers
only to "a stream of coded characters (such as a file)" and "UTF-8
encoded text".

>It appears that Unicode takes old concepts, alter the definitions of them,
>but retains the old name. This is going to create mess. When introducing new
>concepts, Unicode should choose new names.
>
>

This is not a new concept. If a file consisting of characters without
markup etc can be referred to as a "plain text file", and these
characters are Unicode characters, it is not a new concept to refer to a
"Unicode plain text file", but a sensible extension of existing terminology.

>
>
>>>So then one in effect has to rewrite the whole UNIX operative system, in
>>>order to ensure that and UTF-8 compliance. ...
>>>
>>>
>
>
>
>>Well, hardly the whole OS. One approach, if a system locale is UTF-8, is
>>to rewrite the file handling only so that any file opened in text mode
>>starting with the BOM signature in any of the standard UTFs is converted
>>to BOM-less UTF-8 before being presented to higher levels. The
>>implication of this is that any text data within the system is UTF-8.
>>
>>
>
>This would just be one of the problems. The WWW-page quoted other problems.
>
>

Most of which are fixed by my approach. But, as Lars has pointed out,
the issue is a bit more complex than I realised at first because it
requires text mode to be specified.

>
>
> ...
>
>>We are not talking about any inhouse file format. We are talking about a
>>standard file format specified in the Unicode standard.
>>
>>
>
>The motivation for introducing it into Unicode was that a single MS text
>processor used this in order to identify file contents, ...
>

It would seem to me extremely surprising that Microsoft products were
using this file format before it was standardised by Unicode. Is this in
fact true? Do you have any evidence for it?

>... something not used
>on other platforms. So whereas this is clearly a part of the Unicode
>standard, it is a de facto MS inhouse file format until others start to use
>it.
>
>

Is it in fact true that this format is not used on any other platforms?
That would also greatly surprise me!

> ...
>
>As standard, Unicode will have to fight for recognition. Introducing things
>like the UTF-8 BOM requirement makes it more difficult for Unicode to earn
>that recognition.
>
>

You may be right that the BOM will not help Unicode recognition, but
there is no doubt that Unicode is being widely recognised in the Unix
community as well as all other communities. It will very probably become
dominant within a very few years whether or not there is a BOM - and
whether or not the text file compatibility problem is fixed.

> ...
>
>Just as I, and others will, oppose the UTF-8 BOM requirement for good
>reasons.
>
>

If you wish to propose an amendment to the Unicode standard, there are
proper ways to do that. You might even succeed, at least in having use
of the BOM with UTF-8 deprecated - and if you do Microsoft might
reconsider their strategy although I guess they would first oppose your
proposed change. But for the moment UTF-8 with BOM is part of the
standard whether you like it or not.

> ...
>
>>>>Well, maybe, or maybe as something like "the sequence <i diaeresis,
>>>>guillemet, inverted question mark> ¹Äú–Ø ¬ª ¬ø¹Äù ", ...
>>>>
>>>>
>>>>
>>I note from this mojibake that your system does not support UTF-8
>>properly even without a BOM.
>>
>>
>
>This is not a quote from me. My mail should be in ASCII, as is a usual
>requirement of technical lists.
>
>

Who says? This is nowhere specified as a requirement for this list.

My posting was correctly encoded and labelled as UTF-8 (not with a BOM
but with "Content-Type: text/plain; charset=UTF-8; format=flowed"), and
this list accepts such postings although I am aware that some people
cannot read UTF-8 mail. Your response was labelled as "Content-Type:
text/plain; charset=ISO-8859-1". But your mail client, apparently
"User-Agent: Microsoft-Outlook-Express-Macintosh-Edition/5.0.6" which is
I think rather old, failed to convert the quoted text from the incoming
encoding to the outgoing encoding, but copied the UTF-8 bytes as if they
were already ISO-8859-1 bytes. That appears to be a bug in your mail
client, although it just might be a configuration error. Well, maybe
this is evidence for your assertion that Microsoft does not always
implement Unicode correctly!

...

>>Indeed. But they do need to be updated because of a Unicode standard
>>file format.
>>
>>
>
>As the Unicode standard stands now, yes, in view of that Unicode has adapted
>an MS inhouse file format as a part of its standard. But Unicode should not
>favor a particular platform this way.
>
>
>
In another message you wrote:

>Somehow that practise of that
>particular piece of software has slipped into the Unicode UTF-8 standard.
>

Please stop repeating this allegation, at least unless you have some
proof that Microsoft was using this format before Unicode standardised
it. Anyway, nothing slips into the Unicode standard; everything has to
be accepted by the UTC, and Unix experts have had ample opportunity to
object.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/
-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.7.0 - Release Date: 17/01/2005

Next message: Jon Hanna: "RE: Subject: Re: 32'nd bit & UTF-8"
Previous message: Rick McGowan: "Re: UTF-8 'BOM'"
In reply to: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Jon Hanna: "RE: Subject: Re: 32'nd bit & UTF-8"
Reply: Jon Hanna: "RE: Subject: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 10:52:12 CST