Re: Representing Unix filenames in Unicode

From: Hans Aberg (
Date: Sun Nov 27 2005 - 11:45:23 CST

  • Next message: Samuel Thibault: "Re: Representing Unix filenames in Unicode"

    On 27 Nov 2005, at 16:03, Marcin 'Qrczak' Kowalczyk wrote:

    > A common problem of programming languages which use Unicode for
    > all its strings (either in the form of code points or UTF-16) is
    > interfacing with Unix APIs based on byte strings, and representing
    > filenames, environment variables, program invocation arguments etc.
    > in the program.
    > From the point of view of the OS they are arbitrary byte strings,
    > usually excluding only NUL. From the point of view of the user they
    > are generally meant to be interpreted as text. Their encoding is
    > implicit; the locale setting provides a reasonable default. But even
    > if the encoding intended to be UTF-8, the OS doesn't enforce that it
    > is valid UTF-8. It's rare when filenames are not valid in the selected
    > encoding, and most filenames are ASCII, so only very rare cases are
    > truly problematic.
    > How to convert these byte strings to Unicode?

    This problem has recently been discussed in the POSIX/UNIX
    standardization list (Austin Group List,
    austin/). It should really be best resolved there, because one needs
    to find an efficient solution for a UTF-8 enabled UNIX OS, and in
    doing that, one has to take things into account such as how to
    implement efficient files systems. One possible approach might be to
    ensure any byte string can be represented on the filesystems level,
    with suitable UTF-8 encodings for use in text strings (and the
    property that they can be lifted back to the original byte strings),
    which may vary from context to context. This approach would be
    motivated by the fact that almost all filesystems already work this
    way, and that it would be inefficient to burden it with character
    interpretation schemes. But some filesystems, though rare it seems,
    use a different approach. And when fiddling around with this, one
    needs to assess its effect on the total UNIX OS, probably making some
    implementations first. In the meantime, I figure you can invent the
    encoding schemes that best fits your needs.

       Hans Aberg

    This archive was generated by hypermail 2.1.5 : Sun Nov 27 2005 - 11:46:54 CST