RE: Subject: Re: 32'nd bit & UTF-8

From: Oliver Christ (
Date: Wed Jan 19 2005 - 15:33:48 CST

  • Next message: Peter Kirk: "Re: Subject: Re: 32'nd bit & UTF-8"

    Marcin Kowalczyk wrote:

    > The problem with BOM in UTF8 is that it must be specially
    > handled by all applications. It effectively turns UTF-8 into
    > a stateful encoding where the beginning of a "text stream"
    > must be treated specially.

    Which is just the same for any other BOM or an encoding specification in
    HTML's META element (which is much worse as you need to read quite some
    content before you know the encoding in which to actually read).

    I don't see a big difference between the UTF16 BOMs and the UTF8 one.
    All signal that the file's encoding is Unicode, and specify which
    "variant" is actually used.

    It should also be relatively simple to pipe any input through e.g. GNU's
    recode for encoding normalization to UTF16 or whatever so that only one
    module (the recoder) needs to be aware of BOMs (and/or "sniffing"
    heuristics). The stream models in Java and .Net implement exactly that.

    Hans Aberg added:

    > It is clear that the use of a BOM in UTF-8 should properly be
    > viewed as a file format, and not a character encoding format.

    That's not clear to me. I find UTF8 BOMs at the beginning of e.g. an
    .html or .csv file pretty useful, equally useful to { 0xFE 0xFF } or {
    0xFF 0xFE } at the beginning of a file. I don't think it helps when
    'file' would report such files as "UTF8 encoded text written by Notepad
    or .Net". But maybe I misunderstood your comment.

    Cheers, Oli

    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 15:36:42 CST