Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: John Cowan (jcowan@reutershealth.com)
Date: Mon Dec 06 2004 - 14:52:31 CST

  • Next message: Doug Ewell: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"

    Doug Ewell scripsit:

    > > Now suppose you have a UNIX filesystem, containing filenames in a
    > > legacy encoding (possibly even more than one). If one wants to switch
    > > to UTF-8 filenames, what is one supposed to do? Convert all filenames
    > > to UTF-8?
    >
    > Well, yes. Doesn't the file system dictate what encoding it uses for
    > file names? How would it interpret file names with "unknown" characters
    > from a legacy encoding? How would they be handled in a directory
    > search?

    Windows filesystems do know what encoding they use. But a filename on
    a Unix(oid) file system is a mere sequence of octets, of which only 00
    and 2F are interpreted. (Filenames containing 20, and especially 0A,
    are annoying to handle with standard tools, but not illegal.)

    How these octet sequences are translated to characters, if at all,
    is no concern of the file system's. Some higher-level tools, such as
    directory listers and shells, have hardwired assumptions, others have
    changeable assumptions, but all are assumptions.

    -- 
    John Cowan  jcowan@reutershealth.com  www.reutershealth.com  www.ccil.org/~cowan
    No man is an island, entire of itself; every man is a piece of the
    continent, a part of the main.  If a clod be washed away by the sea,
    Europe is the less, as well as if a promontory were, as well as if a
    manor of thy friends or of thine own were: any man's death diminishes me,
    because I am involved in mankind, and therefore never send to know for
    whom the bell tolls; it tolls for thee.  --John Donne
    


    This archive was generated by hypermail 2.1.5 : Mon Dec 06 2004 - 14:53:38 CST