Re: unicode on Linux

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Thu Oct 23 2003 - 16:42:19 CST


Stefan Persson wrote:
> Stephane Bortzmeyer wrote:
>
>> I do not agree. It would mean *each* application has to normalize
>> because it cannot rely on the kernel. It has huge security
>> implications (two file names with the same name in NFC, so visually
>> impossible to distinguish, but two different string of code points).
>
> Couldn't this cause problems if copying two files to a floppy on a
> system NOT normalising the data ...

An even bigger problem, as far as I know, is that the Unix/Linux file systems just store filenames
as streams of bytes (except for 0 and the ASCII code for '/') and do not enforce any particular
encoding. You just cannot rely on a filename being in UTF-8 unless you know which application
generated it and how that application works.

If you want to be safe with filenames on Unix/Linux, you may need to use your own, custom
normalization+encoding to map Unicode strings to ASCII. Within your system, you can then control the
normalization etc. (As an example for an encoding to ASCII, you could use the one that IMAP defines
for folder names - a variant of UTF-7 - because it is designed with the Unix filesystem in mind.)

markus



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST