Re: Representing Unix filenames in Unicode

From: Philippe Verdy (
Date: Sun Nov 27 2005 - 11:16:45 CST

  • Next message: Hans Aberg: "Re: Representing Unix filenames in Unicode"

    From: "Marcin 'Qrczak' Kowalczyk" <>
    > - When converting from Unicode to byte strings, only U+0000 followed
    > by another U+0000 or by a character between U+0080 and U+00FF gets
    > converted. U+0000 followed by any other character is an error.
    > - This encoding has the properties that any byte string converted
    > to Unicode will yield back to the same byte string, that any valid
    > UTF-8 byte string not containing 0x00 will be converted to the
    > same Unicode string as with UTF-8, and that any Unicode string not
    > containing U+0000 will be converted to the same byte string as with
    > UTF-8.

    If you want to keep the compatibility with null-ended byte strings, may be
    the alternative using really non-character code points might help. So I
    would use something like U+FFFE followed by a codepoint in U+0080..U+00FF.
    But this means that the *valid* UTF-8 encoded byte string that represents
    U+FFFF would have to be escaped when converted to a string of code points,
    and it may still happen that a valid string of code points containing U+FFFE
    could be stored in a byte string to create filenames.

    Really, you cannot reach a full bijection for those cases: as soon as you
    know that a string of valid code points can contain any occurence of U+0000
    or U+FFFE (internally stored as 16-bit or 32-bit code units or even with
    UTF-8, it does not matter here), you're working in a gray area where your
    algorithm is not working only with characters. We speak there about code
    point strings which is effectively a superset of Unicode character strings.

    So the goodquestion for designing any API is to ask whever it has to handle
    only characters, or codepoints. Regarding the Unix filesystem APIs, it is
    clear that it does not work at the character level (like Windows), but at
    the byte stream level which defines its own superset of the code point
    string level. If you realize that this byte stream level is a superset,
    there's simply no way to create a bijection even with the code point string
    level. You're definitely in a gray area where unicity cannot be guaranteed.

    And yes this creates a security risk as soon as you perform a conversion
    from code point strings to byte streams, i.e. when trying to access the
    filesystem from a valid code point string. The only way to avoid such risk
    is to restrict the access to the filesystem by only allowing code point
    strings that are valid character strings.

    This effectively means that users of that interface won't be able to access
    to every file on the filesystem, and only administrators of that system will
    have the tools to interact with it at the byte stream level, to manage the
    case of existing filenames with invalid UTF-8 sequences: this could be
    performed by tools like "fsck" run by sys-admins on Unix/Linux that will
    correct these filenames to enforce this security, by renaming them into
    non-conflicting names (possibly with a leading ".#" prefix to "hide" them in
    user interfaces, and with an extra numeric extension in case of conflict).

    So I see absolutely no need to add more complexity to programs, and what
    Java does looks very valid in this perspective. Personnaly I see no interest
    for making programs more complex. They should be written to treat filenames
    as character strings, not byte strings. This means that APIs that read
    directory entries should silently discard and ignore the discovered names
    that are incorrectly encoded (not trying to disguise them as these files
    won't be openable or deletable under these modified names!), and the API
    that attempts to delete what seems to be anempty directory should simply
    return an error if there remains a file (all programs should already be
    ready to handle such error, because filesystems can be used concurrently by
    other users or programs that could link perfectly valid filenames into the
    same directory).

    Really don't try to disguise things like you do: you add new security risks
    instead of palliating it. It's up to the filesystem or system tools to
    assert that filenames are correctly encoded as they should. A program
    running in a UTF-8 locale would then support no error, and if it runs from
    another locale, it should detect that and already be prepared to the fact
    that it won't be able to see and handle all files present on a filesystem.
    Don't forget that even with no encoding errors in a filesystem, all programs
    should be ready to support the fact that they won't see all files present in
    a filesystem, due to user access restrictions.

    This archive was generated by hypermail 2.1.5 : Sun Nov 27 2005 - 11:20:51 CST