Re: Representing Unix filenames in Unicode

From: Marcin 'Qrczak' Kowalczyk (
Date: Sun Nov 27 2005 - 13:25:58 CST

  • Next message: Hans Aberg: "Re: Representing Unix filenames in Unicode"

    "Philippe Verdy" <> writes:

    > If you want to keep the compatibility with null-ended byte strings,
    > may be the alternative using really non-character code points might
    > help.

    What do you mean by "compatibility with null-ended byte streams"?

    The point in using U+0000 as the escape character is that it does not
    appear when filenames are converted to Unicode using pure UTF-8. And
    it's the only such code point (unless we count surrogates, but abusing
    them would be worse).

    This means that any filename which can be decoded using pure UTF-8,
    decodes to the same string using UTF-8-with-escaped-bytes. And any
    string which can be encoded into a filename using pure UTF-8 at all
    (i.e. consisting only of code points U+0001..U+D7FF or U+E000..U+10FFFF)
    encodes to the same string using UTF-8-with-escaped-bytes.

    > Really, you cannot reach a full bijection for those cases:

    Actually it would be possible, but it's hard to design a bijection
    with sensible properties like preserving concatenation and preserving
    ASCII fragments.

    But I don't need a bijection: it's acceptable when there are Unicode
    strings which can't be used as filenames. It's already the case in
    pure UTF-8 (due to U+0000 and "/").

    The only undesirable property is that there exist different Unicode
    strings which map to the same byte string. This can be fixed, at the
    cost of complicating the algorithm (by disallowing escaping those
    sequences which would yield valid UTF-8 representations of characters);
    the fixed algorithm has properties quite analogous to UTF-8, except
    that all byte strings are covered. In particular 0x01..0x7F correspond
    to U+0001..U+007F and vice versa.

    > And yes this creates a security risk as soon as you perform a
    > conversion from code point strings to byte streams, i.e. when trying
    > to access the filesystem from a valid code point string.

    I don't see a larger security risk than making the default conversion
    depend on the locale at all.

    > This effectively means that users of that interface won't be able to
    > access to every file on the filesystem, and only administrators of
    > that system will have the tools to interact with it at the byte stream
    > level, to manage the case of existing filenames with invalid UTF-8
    > sequences: this could be performed by tools like "fsck" run by
    > sys-admins on Unix/Linux that will correct these filenames to enforce
    > this security, by renaming them into non-conflicting names (possibly
    > with a leading ".#" prefix to "hide" them in user interfaces, and with
    > an extra numeric extension in case of conflict).

    A programming language doesn't have the power to declare some
    filenames as not kosher. They are valid from the Unix perspective,
    so unless the OS prevents creating them in the first place, a language
    which doesn't allow to access them is handicapped.

    > So I see absolutely no need to add more complexity to programs, and
    > what Java does looks very valid in this perspective.

    This is not adding complexity to programs. It's adding it to the
    runtime of the programming language.

    What Java does is that converting a byte string to Unicode and back
    can yield a different byte string without signalling any error
    (invalid UTF-8 fragments gets converted to U+FFFD which has a
    different representation in UTF-8). I can pass an existing filename as
    an argument to the program, and the program will access a different
    file. This is bad.

    (I'm talking about Sun Java implementation. GCJ is even worse because
    it uses different default encodings in different places, and assumes
    that filenames are encoded in Java-modified UTF-8 only. At least this
    was the case when I last time looked at it.)

    > This means that APIs that read directory entries should silently
    > discard and ignore the discovered names that are incorrectly encoded

    What about getting the current directory? Getting program arguments?
    Getting environment variables? Reading the target of a symbolic link?
    Getting the mount point of a volume? You can't pretend that they don't
    exist. Especially program arguments in a language like Java.

    > (not trying to disguise them as these files won't be openable or
    > deletable under these modified names!),

    Of course they are openable and deletable. The encoding is the inverse
    function of the decoding. Encoding is partial, as in pure UTF-8;
    it's a partial decoding function which is unfortunate.

       __("<         Marcin Kowalczyk

    This archive was generated by hypermail 2.1.5 : Sun Nov 27 2005 - 13:28:13 CST