Re: UNIX Unicode APIs (*not* character sets)

From: Juliusz Chroboczek (jec@dcs.ed.ac.uk)
Date: Sun Aug 20 2000 - 20:14:52 EDT


>> what is the status of various UNIXes and lookalikes as far
>> as "Unicode objects", that is, anything named using Unicode
>> encodings like UTF-8 or UTF-16XX, are concerned.

DS> Linux (and probably other Unixes - I don't know) accept arbitrary
DS> byte sequences for filenames, so long as it doesn't include '/',
DS> '\0' and probably the C0 characters.

More precisely, a filename under Unix consists of a finite sequence of
bytes other than 0x00 and 0x2F (0x01 through 0x1F are legal); neither
the kernel nor the filesystem code impose any interpretation on
filenames. This is left to the application; older applications
interpret filenames as ASCII strings, newer applications according to
the external character set of the current locale.

The above is of course only true if you're using a native filesystem;
most modern Unices support transparent access to FAT and VFAT
filesystems, which may have different requirements.

DS> The userland programs interpret it in the locale character
DS> set. Solaris and other Unixes have UTF8 locales, but Linux won't
DS> really have UTF8 locales until glibc 2.2 comes out.

As you point out, this approach is not quite satisfactory. In
particular, it requires that two locales should be implemented for
every single country/language: one that uses the ``legacy'' character
set, and one that uses UTF-8 externally (and probably UCS-4
internally, although that's implementation dependent).

Of course, we may expect the internationalisation tools to generate
data for both locales from a single source, but that hardly strikes me
as an elegant solution.

Regards,

                                        J.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:07 EDT