Re: Representing Unix filenames in Unicode

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Nov 27 2005 - 15:06:30 CST

Next message: Philippe Verdy: "Re: Character delta between Unicode 4.1 and 5.0"

Previous message: Philippe Verdy: "Fw: Representing Unix filenames in Unicode"
In reply to: Marcin 'Qrczak' Kowalczyk: "Re: Representing Unix filenames in Unicode"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Representing Unix filenames in Unicode"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Representing Unix filenames in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
> What Java does is that converting a byte string to Unicode and back
> can yield a different byte string without signalling any error
> (invalid UTF-8 fragments gets converted to U+FFFD which has a
> different representation in UTF-8). I can pass an existing filename as
> an argument to the program, and the program will access a different
> file. This is bad.

Java has its own implementation issues, but this is not a language defect,
only an implementation bug.

My opinion is that it should not present to users a Unicode filename that it
can't reproduce exactly asit was read from the filesystem. Java already has
a method to query the effective (canonical) filename that is used on the
filesystem after creation. So applications should use it (if not, it's a
application bug, not a Java API design bug).

Applications should also check for file existence using the canonical names
reported from the filesystem (using simple equality does not work with
filesystems that are insensitive to case).

Filesystems that currently allow storing random byte strings are bogous and
should be corrected (the historic UFS filesystem for Unix needs a fix, at
least in its associated filesystem tools like "fsck"). All filesystems
should be consistent with the character encoding they use, even if it's only
pure ASCII (such as ISO9660). If this is not enforced for now in a specific
filesystem, it should be enforced system-wide in the OS itself and all its
API's, with a global system setting considered immediately at boot time.

There's aboslutely no reason for applications running on the same system to
use multiple encodings that the OS can't know. If there must exist several
encodings depending on the user's locale, then the user's locale setting
must be accessible to the OS itself (so the locale system must become part
of it, part of its kernel services, instead of being outside in a
application library).

From my point of view, an application that depends on the OS capability to
store distinct filenames for every random byte stream is bogous.

Note that under Unix filesystems, files are identified by inode numbers, not
by names directly. Names are physically stored in a file identified by a
inode number. The format of that special file can physically embed the
encoding which was used to create that filename. The OS service that manage
the storage of these names to create links to inodes is the "dirent"
subsystem. This is were it should be fixed in the Unix API's (by adding a
parameter that specifies the user's encoding, or by providing a new API were
only valid UTF-8 is permitted).

Next message: Philippe Verdy: "Re: Character delta between Unicode 4.1 and 5.0"
Previous message: Philippe Verdy: "Fw: Representing Unix filenames in Unicode"
In reply to: Marcin 'Qrczak' Kowalczyk: "Re: Representing Unix filenames in Unicode"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Representing Unix filenames in Unicode"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Representing Unix filenames in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 29 2005 - 09:41:29 CST