Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 20:56:15 CST

  • Next message: Kenneth Whistler: "Re: Subject: Re: 32'nd bit & UTF-8"

    On 2005/01/21 02:15, Peter Kirk at peterkirk@qaya.org wrote:

    > On 20/01/2005 20:46, Hans Aberg wrote:

    >> If you know the correct answer of these things, why don't you enlighten
    >> these other posters so that this discussion terminates? After all, requiring
    >> BOM's in UTF-8 data is really stupid, so it must be interesting to get to
    >> know what moron introduced it.

    > I agree, Hans. This requirement would be really stupid.

    I made a poor formulation: The UTF-8 BOM "requirement" is evidently the
    requirement that the BOM in the beginning of a file should be dropped by a
    process when encountered. Thuse not that files must have BOM. But also this
    will screw up UNIX'es. So it does not matter for the subject at hand.

    >But you are the
    > person who introduced it.

    No, I have not introduced it. My discussion was always conditional on the
    accuracy of what others claims.

    >In the light of this perhaps you might like to
    > reconsider the word "moron".

    So that formulation seems appropriate, though strong.

    > One minute later, Hans Aberg wrote:
    >
    >> There is at least an informat notion of a plain text file. And that is UTF-8
    >> without a BOM, I feel sure.

    > Well, however sure you are, you are wrong. The Unicode standard
    > specifies that a BOM may optionally appear at the start of a string of
    > UTF-8 characters whose encoding is not otherwise specified.

    So this is hardly the informal notion of a plain text file.

    > (Note
    > carefully the "optionally".) This must include the start of a plain text
    > file, at least where as in the Unix world there is no out-of-band
    > information about its encoding.

    It does not help if it is optional, as if it appears, the program must know
    how to ignore it, and look further down the stream. This is really what
    causes the problem.

    >> Posters said originally that it came from a MS text editor that always
    >> stamps BOM's onto files.

    > I think you are again misunderstanding something I wrote. I was I think
    > the first to mention that a MS text editor emits BOMs at the start of
    > UTF-8 files. But neither I nor I think anyone else except you has said
    > that this format was originated by MS. I suggested the opposite, that MS
    > took this format from the standard as it already existed.

    OK. Let's leave this issue to the interested historian.

    >> The UTF-8 without BOM's is already taking off. But formally, in the eyes of
    >> Unicode, that is a corrupted UTF-8.

    > Not true. Read the standard. Or just read the extract I quoted to you.
    > Or the extracts which Ken has just posted.

    If UNIX can run all its files without BOM's and call the UTF-8 text files
    then that part is a non-issue. And if they additionally do not have to
    recognize BOM's in the beginning of a file and ignore it, then there is not
    BOM issue at all. But this part is somewhat unclear right now.

    > Please tell me who wrote such lies and where. Look in the archives of
    > this list. I think what has really happened is that in your enthusiasm
    > to reply to every posting on this list you haven't bothered to read
    > properly and understand what you are replying to. You have made
    > something like 40 postings in 24 hours. Is this a record?
    >
    > In another message, Hans Aberg wrote, replying to Rick McGowan:
    >
    >> Hmmm... I don't recall that the Unicode Standard ever specifies that the
    >>> Byte Order Mark is *required* to be used anywhere for any purpose. Can you
    >>> point me to the place in the standard where this is stated?

    >> Several poster have cliamed that, most recently Arcane Jill.

    > No, she did not, she wrote precisely the opposite with special emphasis:
    >
    >> Unicode does NOT require that all UTF-8 text files must begin with a BOM

    I have been sloppy in my posts. The full quote of her is:

    >Unicode does NOT require that all UTF-8 text files must begin with a BOM; it
    >only requires that conformant processes can recognize and handle the BOM
    >character /if/ it should be found.

    The quote by me above should be:

    The UTF-8 requirement of prcesses to ignore the BOM.

    The problem is that UNIX processes cannot handle this, and trying to make
    them handle it would screw up the way they work.

    So the UNIX processes are not UTF-8 conformant, and cannot easily be made to
    be that. Do you agree now?

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 21:05:45 CST