Re: UTF-8N?

From: Juliusz Chroboczek (jec@dcs.ed.ac.uk)
Date: Wed Jun 21 2000 - 19:16:31 EDT


(I've allowed myself to quote from a number of distinct posts.)

DE> On the contrary, I thought Peter's point was that the OS (or the
DE> split/ merge programs) should *not* make any special assumptions
DE> about text files.

Sorry if I wasn't clear. I was taking for granted that OSes will not
reliably keep track of file types (we all know the problems that this
creates for VMS and Apple Mac users). I was pointing out that without
a clear notion of file type, the BOM is a bad idea.

PC> Without rules, users will generate UTF-8 files that both do and
PC> don't start with a BOM. If there is software out that that's going
PC> to blow up in one or the other case, that's not a satisfactory
PC> state of affairs.

The problem is not one of broken software. The problem is that, as
John Cowan explained in detail, with the addition of the BOM, UTF-8
and UTF-16 become ambiguous. (In what follows, I use ``a Unicode
file'' for ``a file containing Unicode data in one of UTF-8 or UTF-16'').

It all stems from the fact that U+FEFF is not only what is used for
the BOM, but also a valid Unicode/ISO 10646 codepoint. The issue
would be solved by deprecating the use of U+FEFF as a Unicode
character (for example by defining a new codepoint for ZWNBSP), and
using U+FEFF for the BOM only. The standard could then say that
applications should discard all occurences of U+FEFF when reading a
file, and allow applications to insert U+FEFF at arbitrary points when
writing a Unicode file.

I suspect that deprecating U+FEFF is not politically acceptable for
Unicode and ISO 10646, though.

PC> Doesn't that simply indicate that, in a protocol that disects a
PC> long file into parts to be transmitted separately, it is
PC> inappropriate to add a BOM to the beginnings of the parts, whether
PC> they use UTF-8 or UTF-16?

Appropriate or not, users (you know, those people who don't read the
documentation that the programmers don't write) will use text editors
to split files. They will then concatenate the files using a
non-Unicode aware tool. And they will complain that the checksums
mismatch.

(What do *you* use to split files on a Windows machine that doesn't
have your favourite utilities installed?)

PC> I think that the variations in BOM are just as "uninteresting" as
PC> the variations in line ending:

Just as uninteresting and just as annoying. The difference being that
we've had over twenty years to learn to deal with CR/LF mismatches
(and fixed-length records, and Fortran carriage control). The BOM
issue opens a whole new area to make new mistakes in.

(Who should I contact to register ``UCS-4PDP11'', the mixed-endian
form of UCS-4?)

Regards,

                                        Juliusz Chroboczek



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT