Re: UTF-8N?

From: Juliusz Chroboczek (jec@dcs.ed.ac.uk)
Date: Tue Jun 20 2000 - 15:08:30 EDT


Peter,

I have the impression that we basically agree, except that you expect
the system to reliably keep track of file types, and I don't.

PC> type of object which, when assembled in accordance with that
PC> protocol, can produce a plain text file.

PC> We shouldn't mix up the use of the BOM and protocols that are not
PC> directly related to Unicode.

So you have two distinct operations, ``split a text file'' and ``split
an octet file''. Symmetrically, ``concatenate text files'' and
``concatenate octet files.'' If your splitting and concatenating
operations mismatch, you die.

Of course, no mismatch happens if the OS keeps track of file types.
Splitting in the octet manner a text/plain file leads to two
octet-stream files, and the OS should ensure that you cannot merge
them in the wrong way.

I think that the problems that Mac users have with FTP and other
operations show that relying on file attributes is a bad idea. Please
allow me to ramble a bit.

One of the aspects of Unicode that receives the most publicity is the
fact that now you can write a Mongolian-Telugu. But many of us are
very excited for the opportunities that Unicode gives for simplifying
file management.

Currently, my computer's disk is littered with hundreds of plain text
files. Most of these are plain ASCII, but quite a few are encoded in
one of ISO 8859-1, 8859-2, 8859-15, CP 1252, CP 1250, MacRoman, and
even NeXT's encoding. Most of these files are not tagged, those that
are often carry an incorrect tag. There is no way of determining a
file's encoding short of reading it into an editor and trying to read
it.

At one point, I thought that with Unicode there would be only one
cross-platform encoding, and that a plain text file from a Mac and a
plain text file from a Windows machine would be the same thing (up to
some uninteresting variations in line ending).

Later, I though that there would be two Unicode encodings, the ones
that are now called UTF-16BE and UTF-8N. I was prepared to live with
that.

Right now, it looks like there will be at least 8 Unicode encodings,
at least 4 of whic will be in common use (big-endian UTF-16, UTF-16BE,
UTF-8N, UTF-8). What is worse, some of these formats, including the
most common one, will have to be treated specially when applying
mundane operations such as splitting a file.

                                        J.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT