Linux and UTF8 filenames

From: Martin Kochanski (unicode@cardbox.net)
Date: Mon Sep 16 2002 - 06:05:21 EDT


First of all, I should say that this question is only tangentially related to Unicode per se; but I am asking it here because I'm sure that many people on this list will know the answer or will know where to point me...

I'm writing a system that uses a server to serve up named objects (macros, in this case) to a client - it's a proprietary protocol, so I have no difficulty with encodings in the client-server communication.

On the server, these objects are stored as separate files, and in an ideal world the filename would be the same as the object name; although in practice some changes have to be made (to allow for illegal filename characters in the object names). The primary aim is to allow the server to translate from object names to filenames when storing and retrieving objects, and from filenames to object names when enumerating them. This is easy; but a useful secondary goal is for the correspondence between filenames and object names to be reasonably transparent so that someone who simply does a directory listing can easily identify object names by looking at the filenames [and can use operating system tools to copy and rename the objects]... and this is where the trouble starts, in particular when dealing with non-ASCII object names.

Suppose that I have an object called "RÍve". My aim would be to store this in a file whose name, when it appears in a directory listing, looks like "RÍve".

Windows can use Unicode filenames natively, so there is no problem - just use 0052 00AA 0076 0065 and you've got it.

Linux, to me, is more of a puzzle. The kernel simply treats filenames as a sequence of bytes, so it will happily accept almost anything you throw at it. In particular, 52 EA 76 65 and 52 C3 AA 76 65 are both valid filenames. What I can't immediately work out is what the tools (such as 'ls') will do. Is it universally the case that the tools will assume that those byte-sequence filenames are in UTF8 (in which case the two examples come out as R?ve and RÍve)? Or do they assume a standard locale (perhaps yielding RÍve and R√™ve)? Or is this a switchable option that the user can set? In any case, how can a poor innocent server discover enough about the context in which it is running to know what filename it has to use so that a user who lists a file directory will see "RÍve" on his screen?

This may be a simple question with a one-line answer; but the searches I did didn't seem to give me one, so I hope that someone here can help.

 



This archive was generated by hypermail 2.1.2 : Mon Sep 16 2002 - 07:06:55 EDT