Re: Names for UTF-8 with and without BOM

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Nov 02 2002 - 16:27:00 EST

Next message: John Cowan: "Re: Header Reply-To"

Previous message: Doug Ewell: "Re: Names for UTF-8 with and without BOM"
In reply to: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Next in thread: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Reply: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark Davis <mark dot davis at jtcsv dot com> wrote:

> That is not sufficient. The first three bytes could represent a real
> content character, ZWNBSP or they could be a BOM. The label doesn't
> tell you.

I have never understood under what circumstances a ZWNBSP would ever
appear as the first character of a file. It wouldn't make any sense. A
ZWNBSP prevents a word break between the preceding and following
characters. If there *is* no preceding character, then what is the
point of the ZWNBSP?

Every time this topic comes up, I have asked why a true ZWNBSP would
ever appear as the first character of a file. The only responses I've
heard are:

1. It might not be a discrete file, but the second (or successive)
piece of a file that was split up for some reason (transmission, etc.).

In that case, the interpreting process should take its encoding cue from
the first fragment, and should NEVER reinterpret fragments broken up at
arbitrary points. (Imagine a process modifying a GIF or JPEG file, or
converting CR/LF, based on fragments!) But this is not the point being
discussed anyway; the point is whole files.

2. It could happen; Unicode allows any character to appear anywhere.

Well, almost anywhere. But even so, the likelihood of a U+FEFF as
ZWNBSP appearing at the start of an unsigned UTF-8 file is vanishingly
small compared to the likelihood that the U+FEFF was intended to be a
signature. The rare case is just too rare to invalidate the heuristic
for the much more common case.

In addition, as Michka points out, we now have U+2060 WORD JOINER, whose
entire purpose in life is to be used as U+FEFF was formerly used, as a
ZWNBSP. Any new Unicode text should use U+2060 and not U+FEFF as a word
joiner. It's hard to imagine that UTC and WG2 would have standardized
this if there was a lot of real-world text that used U+FEFF as ZWNBSP.

-Doug Ewell
Fullerton, California

Next message: John Cowan: "Re: Header Reply-To"
Previous message: Doug Ewell: "Re: Names for UTF-8 with and without BOM"
In reply to: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Next in thread: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Reply: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Nov 02 2002 - 17:01:06 EST