Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (
Date: Thu Jan 20 2005 - 14:47:01 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    On 2005/01/20 16:30, Peter Kirk at wrote:

    >Unicode refers
    > only to "a stream of coded characters (such as a file)" and "UTF-8
    > encoded text".

    There is at least an informat notion of a plain text file. And that is UTF-8
    without a BOM, I feel sure.

    >> It appears that Unicode takes old concepts, alter the definitions of them,
    >> but retains the old name. This is going to create mess. When introducing new
    >> concepts, Unicode should choose new names.

    > This is not a new concept. If a file consisting of characters without
    > markup etc can be referred to as a "plain text file", and these
    > characters are Unicode characters, it is not a new concept to refer to a
    > "Unicode plain text file", but a sensible extension of existing terminology.

    Yes, but then one cannot have a BOM included in that concept.

    >> This would just be one of the problems. The WWW-page quoted other problems.

    > Most of which are fixed by my approach. But, as Lars has pointed out,
    > the issue is a bit more complex than I realised at first because it
    > requires text mode to be specified.

    It looks like requiring BOMäs will cause a lot of problems on the UNIX
    platform. It would be OK to define a Unicode text file wreuiring a BOM. It
    would then not be required in plain text files or UTF-8 processes.

    >>> We are not talking about any inhouse file format. We are talking about a
    >>> standard file format specified in the Unicode standard.

    >> The motivation for introducing it into Unicode was that a single MS text
    >> processor used this in order to identify file contents, ...

    > It would seem to me extremely surprising that Microsoft products were
    > using this file format before it was standardised by Unicode. Is this in
    > fact true? Do you have any evidence for it?

    I do not know more than what other posters mentioned at the beginning of
    this thread. MS uses a little endian approach, therefore BOM's were
    important to them in the context of UTF-16 or early 16 bit Unicode. Then the
    idea was carried over to UTF-8, that seems to be at least a logical seuqnce
    of events.

    >> ... something not used
    >> on other platforms. So whereas this is clearly a part of the Unicode
    >> standard, it is a de facto MS inhouse file format until others start to use
    >> it.

    > Is it in fact true that this format is not used on any other platforms?
    > That would also greatly surprise me!

    Posters said originally that it came from a MS text editor that always
    stamps BOM's onto files.

    > You may be right that the BOM will not help Unicode recognition, but
    > there is no doubt that Unicode is being widely recognised in the Unix
    > community as well as all other communities. It will very probably become
    > dominant within a very few years whether or not there is a BOM - and
    > whether or not the text file compatibility problem is fixed.

    The UTF-8 without BOM's is already taking off. But formally, in the eyes of
    Unicode, that is a corrupted UTF-8.

    >> Just as I, and others will, oppose the UTF-8 BOM requirement for good
    >> reasons.

    > If you wish to propose an amendment to the Unicode standard, there are
    > proper ways to do that. You might even succeed, at least in having use
    > of the BOM with UTF-8 deprecated - and if you do Microsoft might
    > reconsider their strategy although I guess they would first oppose your
    > proposed change.

    Well, who wants to waste time fighting MS? :-)

    >But for the moment UTF-8 with BOM is part of the
    > standard whether you like it or not.

    But whether you like it or not, UTF-8 with BOMäs will not be used in the
    UNIX world.

    >> This is not a quote from me. My mail should be in ASCII, as is a usual
    >> requirement of technical lists.

    > Who says? This is nowhere specified as a requirement for this list.

    I just mention my practise, so others may be informed.

    > In another message you wrote:
    >> Somehow that practise of that
    >> particular piece of software has slipped into the Unicode UTF-8 standard.

    > Please stop repeating this allegation, at least unless you have some
    > proof that Microsoft was using this format before Unicode standardised
    > it. Anyway, nothing slips into the Unicode standard; everything has to
    > be accepted by the UTC, and Unix experts have had ample opportunity to
    > object.

    As I mentioend before, this is what other posters said. Go to them for

    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 14:50:19 CST