Re: Subject: Re: 32'nd bit & UTF-8

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Jan 20 2005 - 19:15:33 CST

  • Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"

    On 20/01/2005 20:46, Hans Aberg wrote:

    > ...
    >
    >It is not my claim, but some posters originally said that the reason for
    >requiring the BOM in UTF-8 processes a MS text editor that always stamped
    >BOM's onto UTF-8 files.
    >
    >If you know the correct answer of these things, why don't you enlighten
    >these other posters so that this discussion terminates? After all, requiring
    >BOM's in UTF-8 data is really stupid, so it must be interesting to get to
    >know what moron introduced it.
    >
    >
    >
    I agree, Hans. This requirement would be really stupid. But you are the
    person who introduced it. In the light of this perhaps you might like to
    reconsider the word "moron".

    Possibly you imagined this requirement because you misundersood what I
    wrote. I wrote something like that a process reading a UTF-8 stream was
    obliged to recognise a BOM as such, and not as U+FEFF, because that is
    what the Unicode standard seems to say although it leaves some room for
    interpretation. I never suggested that a UTF-8 stream was required to
    start with a BOM, as this is clearly untrue - in fact the Unicode
    standard explicitly recommends against this Microsoft practice in most
    circumstances.

    One minute later, Hans Aberg wrote:

    >There is at least an informat notion of a plain text file. And that is UTF-8
    >without a BOM, I feel sure.
    >

    Well, however sure you are, you are wrong. The Unicode standard
    specifies that a BOM may optionally appear at the start of a string of
    UTF-8 characters whose encoding is not otherwise specified. (Note
    carefully the "optionally".) This must include the start of a plain text
    file, at least where as in the Unix world there is no out-of-band
    information about its encoding.

    ...

    >Posters said originally that it came from a MS text editor that always
    >stamps BOM's onto files.
    >

    I think you are again misunderstanding something I wrote. I was I think
    the first to mention that a MS text editor emits BOMs at the start of
    UTF-8 files. But neither I nor I think anyone else except you has said
    that this format was originated by MS. I suggested the opposite, that MS
    took this format from the standard as it already existed.

    ...

    >The UTF-8 without BOM's is already taking off. But formally, in the eyes of
    >Unicode, that is a corrupted UTF-8.
    >

    Not true. Read the standard. Or just read the extract I quoted to you.
    Or the extracts which Ken has just posted.

    ...

    >As I mentioend before, this is what other posters said. Go to them for
    >proof.
    >

    Please tell me who wrote such lies and where. Look in the archives of
    this list. I think what has really happened is that in your enthusiasm
    to reply to every posting on this list you haven't bothered to read
    properly and understand what you are replying to. You have made
    something like 40 postings in 24 hours. Is this a record?

    In another message, Hans Aberg wrote, replying to Rick McGowan:

    >Hmmm... I don't recall that the Unicode Standard ever specifies that the
    >> Byte Order Mark is *required* to be used anywhere for any purpose. Can you
    >> point me to the place in the standard where this is stated?
    >
    >
    >
    >Several poster have cliamed that, most recently Arcane Jill.
    >
    >

    No, she did not, she wrote precisely the opposite with special emphasis:

    > Unicode does NOT require that all UTF-8 text files must begin with a BOM

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    -- 
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.300 / Virus Database: 265.7.0 - Release Date: 17/01/2005
    


    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 20:12:02 CST