Re: unicode on Linux

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Oct 24 2003 - 08:21:37 CST


From: "Stefan Persson" <alsjebegrijptwatikbedoel@yahoo.se>

> Stephane Bortzmeyer wrote:
>
> > I do not agree. It would mean *each* application has to normalize
> > because it cannot rely on the kernel. It has huge security
> > implications (two file names with the same name in NFC, so visually
> > impossible to distinguish, but two different string of code points).
>
> Couldn't this cause problems if copying two files to a floppy on a
> system NOT normalising the data (e.g. a customised kernel) with file
> names that would, when normalised, be identical and then accessing the
> floppy on a system that DOES normalise the data? Then the second system
> might think that the two files have the same file name, and wouldn't
> know which one you're referring to.
>
> Example:
>
> You make two files on system A: "e-acute" and "e combining-acute". You
> move the files to system B, which supports normalising, and request file
> "e-acute". System B normalises that to "e combining-acute", and might
> point to the wrong file. System B thinks that the name of both files is
> "e combining-acute", so even if typing "e comibining-acute" it might
> sometimes return "e-acute".

You've got exactly the same problem on Windows filesystems with
lettercase distinctions: despite lettercase may or may not be
preserved, the filesystem internally normalize case to compare
filenames, so that when there's a file named "a.txt" and you store a new
file "A.TXT", you overwrite the first file, even though both "a.txt" and
"A.TXT" can be retreived. What's even worse, is that when you overwrite
"a.txt" with "A.TXT", the initial lettercase is kept, and if you list the
directory content, you'll see "a.txt", as if your "A.TXT" was not there.

I would say that normalization and other transformations of filenames
can (and is) a recurrent feature of filesystems (it is even normative in
the Mac HFS filesystem, which uses a normalization based on NFD).
Filenames are normally intended to be read by humans and easy to
type in, but they are constrained for performance or compatibility
reasons so that they must remain displayable and can be entered.

Linux/Unix filesystems don't have this property, and filenames on these
systems are intended to be handled by softwares as exact index keys,
even if a user cannot enter their names from the user interface, because
it contains some controls

[Haven't you seen those pesky Linux/Unix filenames containing
backspaces or "clear screen" escape sequences that alter your display
when just performing a simple "ls" command? In some cases,
you had to use kludgy shell commands just to rename/move the
bogous filename.]

In some cases, accepting arbitrary strings of bytes in Unix/Linux filenames
becomes a security issue, which should be fixed so that no application will
create (most often by error) a bogous filename that builds a file that can't
be removed under its current name, or that will break some file transfer
protocols (like FTP), or hang a webserver.

So for me, even if filenames can be made more user-friendly by accepting
Unicode characters, they are not plain text, and inherently will contain
many
restrictions. A good filesystem should either be assumed to be always
Unicode, or specify its character set and naming rules explicitly to
applications (something that has been lacking since long in FAT filesystems
until FAT32 was created with Unicode LFN support).



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST