Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 07 2004 - 17:18:33 CST

  • Next message: Kenneth Whistler: "Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ..."

    RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)I know wht you mean here:
    most Linux/Unix filesystems (as well as many legacy filesystems for Windows
    and MacOS...) do not track the encoding with which filenames were encoded
    and, depending on local user preferences when that user created that file,
    filenames on such systems seem to have unpredictable encodings.

    However the problem comes, most often, when interchanging data from one
    system to another, through removeable volumes or shared volumes.

    Needless to say, these systems were badly designed at their origin, and
    newer filesystems (and OS APIs) offer much better alternative, by either
    storing explicitly on volumes which encoding it uses, or by forcing all
    user-selected encodings to a common kernel encoding such as Unicode encoding
    schemes (this is what FAT32 and NTFS do on filenames created under Windows,
    since Windows 98 or NT).

    I understand that there may exist situations, such as Linux/Unix UFS-like
    filesystems where it will be hard to decide which encoding was used for
    filenames (or simply for the content of plain-text files). For plain-text
    files, which have long-enough data in them, automatic identification of the
    encoding is possible, and used with success in many applications (notably in
    web browsers).

    But foir filenames, which are generally short, automatic identification is
    often difficult. However, UTF-16 remains easy to identify, most often, due
    to the very unusual frequency of low-values in byte sequences on every even
    or odd position. UTF-8 is also easy to identify due to its strict rules
    (without these strict rules, that forbid some sequences, automatic
    identification of the encoding becomes very risky).

    If the encoding cannot be identified precisely and explicitly, I think that
    UTF-16 is much better than UTF-8 (and it also offers a better compromize for
    total size for names in any modern language). However, it's true that UTF-16
    cannot be used on Linux/Unix due to the presence of null bytes. The
    alternative is then UTF-8, but it is often larger than legacy encodings.

    An alternative can then be a mixed encoding selection:
    - choose a legacy encoding that will most often be able to represent valid
    filenames without loss of information (for example ISO-8859-1, or Cp1252).
    - encode the filename with it.
    - try to decode it with a *strict* UTF-8 decoder, as if it was UTF-8
    encoded.
    - if there's no failure, then you must reencode the filename with UTF-8
    instead, even if the result is longer.
    - if the strict UTF-8 decoding fails, you can keep the filename in the first
    8-bit encoding...
    When parsing files:
    - try decoding filenames with *strict* UTF-8 rules. If this does not fail,
    then the filename was effectively encoded with UTF-8.
    - if the decoding failed, decode the filename with the legacy 8-bit
    encoding.

    But even with this scheme, you will find interoperability problems because
    some applications will only expect the legacy encoding, or only the UTF-8
    encoding, without deciding...



    This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 17:19:57 CST