For Asian codesets, EFBB BFxx means about 125 or 127 character pairs.
Even though percentage-wise it will be small, obviously such heuristic will
make your customer's attempt to read certain number of files fail which
means to me, you don't want to use such application in mission-critical
environment. Also, even in casual use, I don't want to have such hard-coded
mechanism in my application without override mechanism.
In Unix environment's text editors, e.g., CDE DtPad text editor, at least you
can control the codeset by applying the locale in the start time of that text
editor so that you can at least choose the codeset of your choice. You can
pick up a phone and ask the sender person of the file if you didn't receive
the information with the flat file. (On top of Application layer protocol,
there is User layer protocol :-)
Unless BOM like tagging can be applied to all flat text files of all (or
at least majority of) codesets being used in the world, such mechanism
shouldn't be advocated. And you know it will never happen. Rather, we should
move up to higher level using something like HTML/XML/SGML markup languages
plus higher level taggings in protocols like MIME/HTTP and so on.
Regarding the zero width no break character, even though it's not visible,
application shouldn't ignore "no break" part unless your application isn't
supporting word resolution features (word wrapping, line wrapping, ...).
] Date: Sat, 17 Oct 1998 07:09:43 -0700 (PDT)
] From: Paul Dempsey <firstname.lastname@example.org>
] > -----Original Message-----
] > From: Ienup Sung [mailto:email@example.com]
] > Sent: Friday, October 16, 1998 8:13 PM
] > To: Unicode List
] > Subject: RE: UTF-8 in email
] > - EF BB BF (or in the other ordering form) doesn't mean that
] > partucular file is a UTF-8 file since in Asian codesets (also in many
] > single byte codesets), EFBB and BFxx can be a pair of valid multibyte
] > (or three single byte characters).
] Of course they're valid characters -- in most SBCS codepages, EVERY octet
] a legal character. However, for codepage 1252 at least, this sequence is
] highly unlikely in a real file (they're a meaningless jumble). How likely
] it that this sequence will start a file in those Asian codesets where this
] is a legal sequence?
] I contend that the occurrence rate of false positives is so low that the
] UTF-8 file signature works well as a file format discriminator. In the case
] where a file beginning with these characters is not actually UTF-8, the
] application will attempt to convert it to UCS-2. It's nearly certain that a
] file that is not UTF-8 will fail to convert, and then the application falls
] back to the default system codepage instead. By general sheer luck, this is
] usually the correct one, and the data is correctly loaded.
] > Therefore, if you hard code such a very limited scope of
] > heuristics in your application without any override mechanism, your
] > applications are not going to be able to support many other
] > codesets.
] > (You have to put codeset selection mechanism somehow and one way or
] > another anyway.)
] Actually, in my application we don't have a codeset user-selection
] for opening files. We may well add it for the next version. The problem
] it is that users must be aware of the mechanism and the need to use it. The
] average user of our product doesn't know what a codepage is -- they just
] want files to open and display correctly. Using a file signature lessens
] the need for such a mechanism.
] > - Also, unless you know somehow this misterious file is any
] > one kind of
] > Unicode files, it wouldn't make much sense to have such additional
] > heuristics in your application. (Q: How many times you will
] > not know whether
] > this Unicode file is UTF-16 or UTF-8 or UCS-4??
] Except for UTF-8, these types of plain-text files are nearly perfectly
] discriminated by the general use of the appropriate signature. As a user I
] don't know or care what the format is because my application reads them
] The BOM file signature UTF-16 files is well established (the XML standard
] gives it normative status for XML streams), and many applications
] successfully require it. After all, the UTF-16 BOM is also a perfectly
] sequence of characters in many codesets, and I've never seen anyone argue
] about it's use as a discriminator for UTF-16 files.
] > Isn't it more easier, if possible, just ask to the sender what is it
] A user can do this in the context of email, if both sender and receiver
] what an encoding is. It shouldn't be necessary. In email you have
] well-defined external mechanisms to specify the encoding, so a file
] signature or communication between the users is unnecessary
] We're really talking about plain-text files here, which generally exist
] outside of a context where there is a well-defined protocol to specify the
] encoding. For plain-text files, a file signature convention is useful.
] > Or just try to open it three times with each one of them?)
] Most of my customers would find that unacceptable.
] > - Since it can be a "zero width no break space" character,
] > you also need to
] > give some kind of choice whether end-users want to use it
] > as zero width
] > no break space character or to indicate whether that
] > particular character is
] > to indicate this is UTF-8 file and ignore it (or both??).
] I disagree. Since the character has zero width, no visible appearance, and
] has no affect on the properties of the surrounding text, there is no
] difference to the user between the character as a signature and the
] character as part of the text. This is precisely why the UTF-16 BOM is the
] character it is.
] --- Paul Chase Dempsey
] Microsoft Visual Studio Text Editor Development
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT