Re: Subject: Re: 32'nd bit & UTF-8

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Jan 20 2005 - 09:30:11 CST

  • Next message: Jon Hanna: "RE: Subject: Re: 32'nd bit & UTF-8"

    On 20/01/2005 14:14, Hans Aberg wrote:

    > ...
    >
    >>On the contrary, the Unicode standard defines that a BOM should be used
    >>at the start of a plain text file under certain circumstances.
    >>
    >>
    >
    >This is then a Unicode text file, not a plain text file. Calling it a plain
    >text file by Unicode just adds to the confusion.
    >
    >

    The precise terminology here was mine, not Unicode's. Unicode refers
    only to "a stream of coded characters (such as a file)" and "UTF-8
    encoded text".

    >It appears that Unicode takes old concepts, alter the definitions of them,
    >but retains the old name. This is going to create mess. When introducing new
    >concepts, Unicode should choose new names.
    >
    >

    This is not a new concept. If a file consisting of characters without
    markup etc can be referred to as a "plain text file", and these
    characters are Unicode characters, it is not a new concept to refer to a
    "Unicode plain text file", but a sensible extension of existing terminology.

    >
    >
    >>>So then one in effect has to rewrite the whole UNIX operative system, in
    >>>order to ensure that and UTF-8 compliance. ...
    >>>
    >>>
    >
    >
    >
    >>Well, hardly the whole OS. One approach, if a system locale is UTF-8, is
    >>to rewrite the file handling only so that any file opened in text mode
    >>starting with the BOM signature in any of the standard UTFs is converted
    >>to BOM-less UTF-8 before being presented to higher levels. The
    >>implication of this is that any text data within the system is UTF-8.
    >>
    >>
    >
    >This would just be one of the problems. The WWW-page quoted other problems.
    >
    >

    Most of which are fixed by my approach. But, as Lars has pointed out,
    the issue is a bit more complex than I realised at first because it
    requires text mode to be specified.

    >
    >
    > ...
    >
    >>We are not talking about any inhouse file format. We are talking about a
    >>standard file format specified in the Unicode standard.
    >>
    >>
    >
    >The motivation for introducing it into Unicode was that a single MS text
    >processor used this in order to identify file contents, ...
    >

    It would seem to me extremely surprising that Microsoft products were
    using this file format before it was standardised by Unicode. Is this in
    fact true? Do you have any evidence for it?

    >... something not used
    >on other platforms. So whereas this is clearly a part of the Unicode
    >standard, it is a de facto MS inhouse file format until others start to use
    >it.
    >
    >

    Is it in fact true that this format is not used on any other platforms?
    That would also greatly surprise me!

    > ...
    >
    >As standard, Unicode will have to fight for recognition. Introducing things
    >like the UTF-8 BOM requirement makes it more difficult for Unicode to earn
    >that recognition.
    >
    >

    You may be right that the BOM will not help Unicode recognition, but
    there is no doubt that Unicode is being widely recognised in the Unix
    community as well as all other communities. It will very probably become
    dominant within a very few years whether or not there is a BOM - and
    whether or not the text file compatibility problem is fixed.

    > ...
    >
    >Just as I, and others will, oppose the UTF-8 BOM requirement for good
    >reasons.
    >
    >

    If you wish to propose an amendment to the Unicode standard, there are
    proper ways to do that. You might even succeed, at least in having use
    of the BOM with UTF-8 deprecated - and if you do Microsoft might
    reconsider their strategy although I guess they would first oppose your
    proposed change. But for the moment UTF-8 with BOM is part of the
    standard whether you like it or not.

    > ...
    >
    >>>>Well, maybe, or maybe as something like "the sequence <i diaeresis,
    >>>>guillemet, inverted question mark> ", ...
    >>>>
    >>>>
    >>>>
    >>I note from this mojibake that your system does not support UTF-8
    >>properly even without a BOM.
    >>
    >>
    >
    >This is not a quote from me. My mail should be in ASCII, as is a usual
    >requirement of technical lists.
    >
    >

    Who says? This is nowhere specified as a requirement for this list.

    My posting was correctly encoded and labelled as UTF-8 (not with a BOM
    but with "Content-Type: text/plain; charset=UTF-8; format=flowed"), and
    this list accepts such postings although I am aware that some people
    cannot read UTF-8 mail. Your response was labelled as "Content-Type:
    text/plain; charset=ISO-8859-1". But your mail client, apparently
    "User-Agent: Microsoft-Outlook-Express-Macintosh-Edition/5.0.6" which is
    I think rather old, failed to convert the quoted text from the incoming
    encoding to the outgoing encoding, but copied the UTF-8 bytes as if they
    were already ISO-8859-1 bytes. That appears to be a bug in your mail
    client, although it just might be a configuration error. Well, maybe
    this is evidence for your assertion that Microsoft does not always
    implement Unicode correctly!

    ...

    >>Indeed. But they do need to be updated because of a Unicode standard
    >>file format.
    >>
    >>
    >
    >As the Unicode standard stands now, yes, in view of that Unicode has adapted
    >an MS inhouse file format as a part of its standard. But Unicode should not
    >favor a particular platform this way.
    >
    >
    >
    In another message you wrote:

    >Somehow that practise of that
    >particular piece of software has slipped into the Unicode UTF-8 standard.
    >

    Please stop repeating this allegation, at least unless you have some
    proof that Microsoft was using this format before Unicode standardised
    it. Anyway, nothing slips into the Unicode standard; everything has to
    be accepted by the UTC, and Unix experts have had ample opportunity to
    object.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    -- 
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.300 / Virus Database: 265.7.0 - Release Date: 17/01/2005
    


    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 10:52:12 CST