From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 07 2004 - 17:18:33 CST
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)I know wht you mean here:
most Linux/Unix filesystems (as well as many legacy filesystems for Windows
and MacOS...) do not track the encoding with which filenames were encoded
and, depending on local user preferences when that user created that file,
filenames on such systems seem to have unpredictable encodings.
However the problem comes, most often, when interchanging data from one
system to another, through removeable volumes or shared volumes.
Needless to say, these systems were badly designed at their origin, and
newer filesystems (and OS APIs) offer much better alternative, by either
storing explicitly on volumes which encoding it uses, or by forcing all
user-selected encodings to a common kernel encoding such as Unicode encoding
schemes (this is what FAT32 and NTFS do on filenames created under Windows,
since Windows 98 or NT).
I understand that there may exist situations, such as Linux/Unix UFS-like
filesystems where it will be hard to decide which encoding was used for
filenames (or simply for the content of plain-text files). For plain-text
files, which have long-enough data in them, automatic identification of the
encoding is possible, and used with success in many applications (notably in
web browsers).
But foir filenames, which are generally short, automatic identification is
often difficult. However, UTF-16 remains easy to identify, most often, due
to the very unusual frequency of low-values in byte sequences on every even
or odd position. UTF-8 is also easy to identify due to its strict rules
(without these strict rules, that forbid some sequences, automatic
identification of the encoding becomes very risky).
If the encoding cannot be identified precisely and explicitly, I think that
UTF-16 is much better than UTF-8 (and it also offers a better compromize for
total size for names in any modern language). However, it's true that UTF-16
cannot be used on Linux/Unix due to the presence of null bytes. The
alternative is then UTF-8, but it is often larger than legacy encodings.
An alternative can then be a mixed encoding selection:
- choose a legacy encoding that will most often be able to represent valid
filenames without loss of information (for example ISO-8859-1, or Cp1252).
- encode the filename with it.
- try to decode it with a *strict* UTF-8 decoder, as if it was UTF-8
encoded.
- if there's no failure, then you must reencode the filename with UTF-8
instead, even if the result is longer.
- if the strict UTF-8 decoding fails, you can keep the filename in the first
8-bit encoding...
When parsing files:
- try decoding filenames with *strict* UTF-8 rules. If this does not fail,
then the filename was effectively encoded with UTF-8.
- if the decoding failed, decode the filename with the legacy 8-bit
encoding.
But even with this scheme, you will find interoperability problems because
some applications will only expect the legacy encoding, or only the UTF-8
encoding, without deciding...
This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 17:19:57 CST