Re: Representing Unix filenames in Unicode

From: Hans Aberg (haberg@math.su.se)
Date: Sun Nov 27 2005 - 11:45:23 CST

Next message: Samuel Thibault: "Re: Representing Unix filenames in Unicode"

Previous message: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
In reply to: Marcin 'Qrczak' Kowalczyk: "Representing Unix filenames in Unicode"
Next in thread: Christopher JS Vance: "Re: Representing Unix filenames in Unicode"
Reply: Christopher JS Vance: "Re: Representing Unix filenames in Unicode"
Reply: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 27 Nov 2005, at 16:03, Marcin 'Qrczak' Kowalczyk wrote:

> A common problem of programming languages which use Unicode for
> all its strings (either in the form of code points or UTF-16) is
> interfacing with Unix APIs based on byte strings, and representing
> filenames, environment variables, program invocation arguments etc.
> in the program.
>
> From the point of view of the OS they are arbitrary byte strings,
> usually excluding only NUL. From the point of view of the user they
> are generally meant to be interpreted as text. Their encoding is
> implicit; the locale setting provides a reasonable default. But even
> if the encoding intended to be UTF-8, the OS doesn't enforce that it
> is valid UTF-8. It's rare when filenames are not valid in the selected
> encoding, and most filenames are ASCII, so only very rare cases are
> truly problematic.
>
> How to convert these byte strings to Unicode?

This problem has recently been discussed in the POSIX/UNIX
standardization list (Austin Group List, http://www.opengroup.org/
austin/). It should really be best resolved there, because one needs
to find an efficient solution for a UTF-8 enabled UNIX OS, and in
doing that, one has to take things into account such as how to
implement efficient files systems. One possible approach might be to
ensure any byte string can be represented on the filesystems level,
with suitable UTF-8 encodings for use in text strings (and the
property that they can be lifted back to the original byte strings),
which may vary from context to context. This approach would be
motivated by the fact that almost all filesystems already work this
way, and that it would be inefficient to burden it with character
interpretation schemes. But some filesystems, though rare it seems,
use a different approach. And when fiddling around with this, one
needs to assess its effect on the total UNIX OS, probably making some
implementations first. In the meantime, I figure you can invent the
encoding schemes that best fits your needs.

Hans Aberg

Next message: Samuel Thibault: "Re: Representing Unix filenames in Unicode"
Previous message: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
In reply to: Marcin 'Qrczak' Kowalczyk: "Representing Unix filenames in Unicode"
Next in thread: Christopher JS Vance: "Re: Representing Unix filenames in Unicode"
Reply: Christopher JS Vance: "Re: Representing Unix filenames in Unicode"
Reply: Philippe Verdy: "Re: Representing Unix filenames in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Nov 27 2005 - 11:46:54 CST