Re: Representing Unix filenames in Unicode

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Nov 27 2005 - 15:06:30 CST

  • Next message: Philippe Verdy: "Re: Character delta between Unicode 4.1 and 5.0"

    From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
    > What Java does is that converting a byte string to Unicode and back
    > can yield a different byte string without signalling any error
    > (invalid UTF-8 fragments gets converted to U+FFFD which has a
    > different representation in UTF-8). I can pass an existing filename as
    > an argument to the program, and the program will access a different
    > file. This is bad.

    Java has its own implementation issues, but this is not a language defect,
    only an implementation bug.

    My opinion is that it should not present to users a Unicode filename that it
    can't reproduce exactly asit was read from the filesystem. Java already has
    a method to query the effective (canonical) filename that is used on the
    filesystem after creation. So applications should use it (if not, it's a
    application bug, not a Java API design bug).

    Applications should also check for file existence using the canonical names
    reported from the filesystem (using simple equality does not work with
    filesystems that are insensitive to case).

    Filesystems that currently allow storing random byte strings are bogous and
    should be corrected (the historic UFS filesystem for Unix needs a fix, at
    least in its associated filesystem tools like "fsck"). All filesystems
    should be consistent with the character encoding they use, even if it's only
    pure ASCII (such as ISO9660). If this is not enforced for now in a specific
    filesystem, it should be enforced system-wide in the OS itself and all its
    API's, with a global system setting considered immediately at boot time.

    There's aboslutely no reason for applications running on the same system to
    use multiple encodings that the OS can't know. If there must exist several
    encodings depending on the user's locale, then the user's locale setting
    must be accessible to the OS itself (so the locale system must become part
    of it, part of its kernel services, instead of being outside in a
    application library).

    From my point of view, an application that depends on the OS capability to
    store distinct filenames for every random byte stream is bogous.

    Note that under Unix filesystems, files are identified by inode numbers, not
    by names directly. Names are physically stored in a file identified by a
    inode number. The format of that special file can physically embed the
    encoding which was used to create that filename. The OS service that manage
    the storage of these names to create links to inodes is the "dirent"
    subsystem. This is were it should be fixed in the Unix API's (by adding a
    parameter that specifies the user's encoding, or by providing a new API were
    only valid UTF-8 is permitted).



    This archive was generated by hypermail 2.1.5 : Tue Nov 29 2005 - 09:41:29 CST