RE: UTF-8 in email

From: Paul Dempsey (paulde@microsoft.com)
Date: Sat Oct 17 1998 - 10:05:19 EDT


> -----Original Message-----
> From: Ienup Sung [mailto:ienup.sung@eng.sun.com]
> Sent: Friday, October 16, 1998 8:13 PM
> To: Unicode List
> Subject: RE: UTF-8 in email
>
...
>
> - EF BB BF (or in the other ordering form) doesn't mean that
> partucular file is a UTF-8 file since in Asian codesets (also in many
> single byte codesets), EFBB and BFxx can be a pair of valid multibyte
characters
> (or three single byte characters).

Of course they're valid characters -- in most SBCS codepages, EVERY octet is
a legal character. However, for codepage 1252 at least, this sequence is
highly unlikely in a real file (they're a meaningless jumble). How likely is
it that this sequence will start a file in those Asian codesets where this
is a legal sequence?

I contend that the occurrence rate of false positives is so low that the
UTF-8 file signature works well as a file format discriminator. In the case
where a file beginning with these characters is not actually UTF-8, the
application will attempt to convert it to UCS-2. It's nearly certain that a
file that is not UTF-8 will fail to convert, and then the application falls
back to the default system codepage instead. By general sheer luck, this is
usually the correct one, and the data is correctly loaded.

> Therefore, if you hard code such a very limited scope of
> heuristics in your application without any override mechanism, your
> applications are not going to be able to support many other
> codesets.
> (You have to put codeset selection mechanism somehow and one way or
> another anyway.)

Actually, in my application we don't have a codeset user-selection mechanism
for opening files. We may well add it for the next version. The problem with
it is that users must be aware of the mechanism and the need to use it. The
average user of our product doesn't know what a codepage is -- they just
want files to open and display correctly. Using a file signature lessens
the need for such a mechanism.

> - Also, unless you know somehow this misterious file is any
> one kind of
> Unicode files, it wouldn't make much sense to have such additional
> heuristics in your application. (Q: How many times you will
> not know whether
> this Unicode file is UTF-16 or UTF-8 or UCS-4??

Except for UTF-8, these types of plain-text files are nearly perfectly
discriminated by the general use of the appropriate signature. As a user I
don't know or care what the format is because my application reads them
correctly.

The BOM file signature UTF-16 files is well established (the XML standard
gives it normative status for XML streams), and many applications
successfully require it. After all, the UTF-16 BOM is also a perfectly legal
sequence of characters in many codesets, and I've never seen anyone argue
about it's use as a discriminator for UTF-16 files.

> Isn't it more easier, if possible, just ask to the sender what is it
really?

A user can do this in the context of email, if both sender and receiver know
what an encoding is. It shouldn't be necessary. In email you have
well-defined external mechanisms to specify the encoding, so a file
signature or communication between the users is unnecessary

We're really talking about plain-text files here, which generally exist
outside of a context where there is a well-defined protocol to specify the
encoding. For plain-text files, a file signature convention is useful.

> Or just try to open it three times with each one of them?)

Most of my customers would find that unacceptable.

> - Since it can be a "zero width no break space" character,
> you also need to
> give some kind of choice whether end-users want to use it
> as zero width
> no break space character or to indicate whether that
> particular character is
> to indicate this is UTF-8 file and ignore it (or both??).

I disagree. Since the character has zero width, no visible appearance, and
has no affect on the properties of the surrounding text, there is no
difference to the user between the character as a signature and the
character as part of the text. This is precisely why the UTF-16 BOM is the
character it is.

Regards,
--- Paul Chase Dempsey
Microsoft Visual Studio Text Editor Development



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT