Re: UNIX Unicode APIs (not character sets)

From: Juliusz Chroboczek (jec@dcs.ed.ac.uk)
Date: Sun Aug 20 2000 - 20:14:52 EDT

Next message: David Starner: "Re: UNIX Unicode APIs (*not* character sets)"
Previous message: David Starner: "Re: UNIX Unicode APIs (*not* character sets)"
Maybe in reply to: jarkko.hietaniemi@nokia.com: "UNIX Unicode APIs (*not* character sets)"
Next in thread: David Starner: "Re: UNIX Unicode APIs (*not* character sets)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

>> what is the status of various UNIXes and lookalikes as far
>> as "Unicode objects", that is, anything named using Unicode
>> encodings like UTF-8 or UTF-16XX, are concerned.

DS> Linux (and probably other Unixes - I don't know) accept arbitrary
DS> byte sequences for filenames, so long as it doesn't include '/',
DS> '\0' and probably the C0 characters.

More precisely, a filename under Unix consists of a finite sequence of
bytes other than 0x00 and 0x2F (0x01 through 0x1F are legal); neither
the kernel nor the filesystem code impose any interpretation on
filenames. This is left to the application; older applications
interpret filenames as ASCII strings, newer applications according to
the external character set of the current locale.

The above is of course only true if you're using a native filesystem;
most modern Unices support transparent access to FAT and VFAT
filesystems, which may have different requirements.

DS> The userland programs interpret it in the locale character
DS> set. Solaris and other Unixes have UTF8 locales, but Linux won't
DS> really have UTF8 locales until glibc 2.2 comes out.

As you point out, this approach is not quite satisfactory. In
particular, it requires that two locales should be implemented for
every single country/language: one that uses the ``legacy'' character
set, and one that uses UTF-8 externally (and probably UCS-4
internally, although that's implementation dependent).

Of course, we may expect the internationalisation tools to generate
data for both locales from a single source, but that hardly strikes me
as an elegant solution.

Regards,

Next message: David Starner: "Re: UNIX Unicode APIs (*not* character sets)"
Previous message: David Starner: "Re: UNIX Unicode APIs (*not* character sets)"
Maybe in reply to: jarkko.hietaniemi@nokia.com: "UNIX Unicode APIs (*not* character sets)"
Next in thread: David Starner: "Re: UNIX Unicode APIs (*not* character sets)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:07 EDT

Re: UNIX Unicode APIs (*not* character sets)

Re: UNIX Unicode APIs (not character sets)