Re: Names for UTF-8 with and without BOM

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Nov 02 2002 - 16:27:00 EST

  • Next message: John Cowan: "Re: Header Reply-To"

    Mark Davis <mark dot davis at jtcsv dot com> wrote:

    > That is not sufficient. The first three bytes could represent a real
    > content character, ZWNBSP or they could be a BOM. The label doesn't
    > tell you.

    I have never understood under what circumstances a ZWNBSP would ever
    appear as the first character of a file. It wouldn't make any sense. A
    ZWNBSP prevents a word break between the preceding and following
    characters. If there *is* no preceding character, then what is the
    point of the ZWNBSP?

    Every time this topic comes up, I have asked why a true ZWNBSP would
    ever appear as the first character of a file. The only responses I've
    heard are:

    1. It might not be a discrete file, but the second (or successive)
    piece of a file that was split up for some reason (transmission, etc.).

    In that case, the interpreting process should take its encoding cue from
    the first fragment, and should NEVER reinterpret fragments broken up at
    arbitrary points. (Imagine a process modifying a GIF or JPEG file, or
    converting CR/LF, based on fragments!) But this is not the point being
    discussed anyway; the point is whole files.

    2. It could happen; Unicode allows any character to appear anywhere.

    Well, almost anywhere. But even so, the likelihood of a U+FEFF as
    ZWNBSP appearing at the start of an unsigned UTF-8 file is vanishingly
    small compared to the likelihood that the U+FEFF was intended to be a
    signature. The rare case is just too rare to invalidate the heuristic
    for the much more common case.

    In addition, as Michka points out, we now have U+2060 WORD JOINER, whose
    entire purpose in life is to be used as U+FEFF was formerly used, as a
    ZWNBSP. Any new Unicode text should use U+2060 and not U+FEFF as a word
    joiner. It's hard to imagine that UTC and WG2 would have standardized
    this if there was a lot of real-world text that used U+FEFF as ZWNBSP.

    -Doug Ewell
     Fullerton, California



    This archive was generated by hypermail 2.1.5 : Sat Nov 02 2002 - 17:01:06 EST