PRODUCING and DESCRIBING UTF-8 with and without BOM

From: Joseph Boyle (Boyle@siebel.com)
Date: Mon Nov 04 2002 - 09:46:32 EST

  • Next message: John Delacour: "Re: `` ", ` '"

    Thanks for the dozens of responses discussing consumers' behavior on UTF-8
    BOM. This is actually not what I'm concerned with, as I have to take it as a
    given that there is both software that wants UTF-8 BOM and software that
    doesn't want it.

    Could we evaluate the need for separate identifiers for producing or
    describing UTF-8 with and without BOM, or viable alternatives to use in
    control input to a file encoding converter program or encoding checker
    program.

    Thanks, Joseph

    -----Original Message-----
    From: Mark Davis [mailto:mark.davis@jtcsv.com]
    Sent: Sunday, November 03, 2002 12:25 PM
    To: Doug Ewell; Unicode Mailing List
    Cc: Murray Sargent; Joseph Boyle
    Subject: Re: Names for UTF-8 with and without BOM

    Little probability that right double quote would appear at the start of a
    document either. Doesn't mean that you are free to delete it (*and* say that
    you are not modifying the contents).

    I agree that when the UTC decides that a BOM is *only* to be used as a
    signature, and that it would be ok to delete it anywhere in a document (like
    a non-character), then we are in much better shape. This was, as a matter of
    fact proposed for 3.2, but not approved. If we did that for 4.0, then there
    would be much less reason to distinguish UTF-8 'withBOM' from UTF-8
    'withoutBOM'.

    Mark
    __________________________________
    http://www.macchiato.com
    ► “Eppur si muove” ◄

    ----- Original Message -----
    From: "Doug Ewell" <dewell@adelphia.net>
    To: "Unicode Mailing List" <unicode@unicode.org>
    Cc: "Mark Davis" <mark.davis@jtcsv.com>; "Murray Sargent"
    <murrays@exchange.microsoft.com>; "Joseph Boyle" <Boyle@siebel.com>
    Sent: Saturday, November 02, 2002 13:27
    Subject: Re: Names for UTF-8 with and without BOM

    > Mark Davis <mark dot davis at jtcsv dot com> wrote:
    >
    > > That is not sufficient. The first three bytes could represent a real
    > > content character, ZWNBSP or they could be a BOM. The label doesn't
    > > tell you.
    >
    > I have never understood under what circumstances a ZWNBSP would ever
    > appear as the first character of a file. It wouldn't make any sense.
    > A ZWNBSP prevents a word break between the preceding and following
    > characters. If there *is* no preceding character, then what is the
    > point of the ZWNBSP?
    >
    > Every time this topic comes up, I have asked why a true ZWNBSP would
    > ever appear as the first character of a file. The only responses I've
    > heard are:
    >
    > 1. It might not be a discrete file, but the second (or successive)
    > piece of a file that was split up for some reason (transmission,
    > etc.).
    >
    > In that case, the interpreting process should take its encoding cue
    > from the first fragment, and should NEVER reinterpret fragments broken
    > up at arbitrary points. (Imagine a process modifying a GIF or JPEG
    > file, or converting CR/LF, based on fragments!) But this is not the
    > point being discussed anyway; the point is whole files.
    >
    > 2. It could happen; Unicode allows any character to appear anywhere.
    >
    > Well, almost anywhere. But even so, the likelihood of a U+FEFF as
    > ZWNBSP appearing at the start of an unsigned UTF-8 file is vanishingly
    > small compared to the likelihood that the U+FEFF was intended to be a
    > signature. The rare case is just too rare to invalidate the heuristic
    > for the much more common case.
    >
    > In addition, as Michka points out, we now have U+2060 WORD JOINER,
    > whose entire purpose in life is to be used as U+FEFF was formerly
    > used, as a ZWNBSP. Any new Unicode text should use U+2060 and not
    > U+FEFF as a word joiner. It's hard to imagine that UTC and WG2 would
    > have standardized this if there was a lot of real-world text that used
    > U+FEFF as ZWNBSP.
    >
    > -Doug Ewell
    > Fullerton, California
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Nov 04 2002 - 10:26:33 EST