Re: Representing Unix filenames in Unicode

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sun Nov 27 2005 - 13:25:58 CST

Next message: Hans Aberg: "Re: Representing Unix filenames in Unicode"

Previous message: Samuel Thibault: "Re: Representing Unix filenames in Unicode"
In reply to: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
Next in thread: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
Reply: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
Reply: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Philippe Verdy" <verdy_p@wanadoo.fr> writes:

> If you want to keep the compatibility with null-ended byte strings,
> may be the alternative using really non-character code points might
> help.

What do you mean by "compatibility with null-ended byte streams"?

The point in using U+0000 as the escape character is that it does not
appear when filenames are converted to Unicode using pure UTF-8. And
it's the only such code point (unless we count surrogates, but abusing
them would be worse).

This means that any filename which can be decoded using pure UTF-8,
decodes to the same string using UTF-8-with-escaped-bytes. And any
string which can be encoded into a filename using pure UTF-8 at all
(i.e. consisting only of code points U+0001..U+D7FF or U+E000..U+10FFFF)
encodes to the same string using UTF-8-with-escaped-bytes.

> Really, you cannot reach a full bijection for those cases:

Actually it would be possible, but it's hard to design a bijection
with sensible properties like preserving concatenation and preserving
ASCII fragments.

But I don't need a bijection: it's acceptable when there are Unicode
strings which can't be used as filenames. It's already the case in
pure UTF-8 (due to U+0000 and "/").

The only undesirable property is that there exist different Unicode
strings which map to the same byte string. This can be fixed, at the
cost of complicating the algorithm (by disallowing escaping those
sequences which would yield valid UTF-8 representations of characters);
the fixed algorithm has properties quite analogous to UTF-8, except
that all byte strings are covered. In particular 0x01..0x7F correspond
to U+0001..U+007F and vice versa.

> And yes this creates a security risk as soon as you perform a
> conversion from code point strings to byte streams, i.e. when trying
> to access the filesystem from a valid code point string.

I don't see a larger security risk than making the default conversion
depend on the locale at all.

> This effectively means that users of that interface won't be able to
> access to every file on the filesystem, and only administrators of
> that system will have the tools to interact with it at the byte stream
> level, to manage the case of existing filenames with invalid UTF-8
> sequences: this could be performed by tools like "fsck" run by
> sys-admins on Unix/Linux that will correct these filenames to enforce
> this security, by renaming them into non-conflicting names (possibly
> with a leading ".#" prefix to "hide" them in user interfaces, and with
> an extra numeric extension in case of conflict).

A programming language doesn't have the power to declare some
filenames as not kosher. They are valid from the Unix perspective,
so unless the OS prevents creating them in the first place, a language
which doesn't allow to access them is handicapped.

> So I see absolutely no need to add more complexity to programs, and
> what Java does looks very valid in this perspective.

This is not adding complexity to programs. It's adding it to the
runtime of the programming language.

What Java does is that converting a byte string to Unicode and back
can yield a different byte string without signalling any error
(invalid UTF-8 fragments gets converted to U+FFFD which has a
different representation in UTF-8). I can pass an existing filename as
an argument to the program, and the program will access a different
file. This is bad.

(I'm talking about Sun Java implementation. GCJ is even worse because
it uses different default encodings in different places, and assumes
that filenames are encoded in Java-modified UTF-8 only. At least this
was the case when I last time looked at it.)

> This means that APIs that read directory entries should silently
> discard and ignore the discovered names that are incorrectly encoded

What about getting the current directory? Getting program arguments?
Getting environment variables? Reading the target of a symbolic link?
Getting the mount point of a volume? You can't pretend that they don't
exist. Especially program arguments in a language like Java.

> (not trying to disguise them as these files won't be openable or
> deletable under these modified names!),

Of course they are openable and deletable. The encoding is the inverse
function of the decoding. Encoding is partial, as in pure UTF-8;
it's a partial decoding function which is unfortunate.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Hans Aberg: "Re: Representing Unix filenames in Unicode"
Previous message: Samuel Thibault: "Re: Representing Unix filenames in Unicode"
In reply to: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
Next in thread: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
Reply: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
Reply: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Nov 27 2005 - 13:28:13 CST