Re: Q: Filesystem Encoding

From: Jungshik Shin (jshin@mailaps.org)
Date: Wed Jul 10 2002 - 13:07:48 EDT


On Wed, 10 Jul 2002, Barry Caplan wrote:

> At 08:43 AM 7/10/2002 -0400, Jungshik Shin wrote:
> >> In short: should I still stick to ASCII alone in filenames, or are there
> >> filesystems where I really don't have to anymore? Thanks in advance.
> >
> > Definitely/unconditionally no for NTFS. As for Linux ext2(and most other
> >Unix fs'), unless you mix up UTF-8 and legacy encodings (which you
> >wouldn't because you have never used non-ASCII), it's all right to switch
> >to UTF-8 and use non-ASCII chars.
>
> But be aware that such filenames may or may not be able to be
> transferred *across* file systems.

 You're absolutely right. Another related problem is normalization.
For instance, MacOS X uses one NF while NTFS uses another. And, I haven't
dug up what's planned about this on Unix fs and NFS front . Some Unix
fs-related APIs may have to be extended to deal with NF's.

> Not only that, but, although I haven't tested in detail for a while,
> I would not be fully comfortable with middleware that is responsible for
> managing file names across systems either, such as FTP, email attachments,
> and Samba. Particularly in the case of FTP and email, just because one
> client works does not mean another one will.

  Samba 3.0 appears to support Unicode (see
http://sambaaxp.org/xamba_XP_2002/vergeichick.pdf). BTW, from my own
experience, I know that codepage-based (non-unicode encoding) support
in samba 2.x works well between Win2k and Unix.

  As for email attachment, one should stick to IETF RFC 2231. Of course,
not all email clients are compliant to RFC 2231(Mozilla and Pine
are among the compliant), but I think that's the best way to get
your filenames across. Even fewer web clients and servers abide by RFC
2231(actually, I haven't seen any. None of Mozilla 1.x, Lynx 2.8, and MS
IE 6 supports this.) when it comes to http Content-Disposition header
(the same header used for email attachment). Hopefully, this will change.
(e.g. http://bugzilla.mozilla.org/show_bug.cgi?id=155949)

  Some IETF drafts and RFCs have been written about I18N of FTP
and are available at
http://www.ietf.org/html.charters/ftpext-charter.html.
By any means, this is not to say that one can right now use
Unicode(UTF-8) for FTP except when one uses Kermit.

> Also keep in mind that even if the file name transfers exactly correct,
> there is no guarantee, except, for ASCII characters, that the system
> will have fonts to display the file name.

  Well, not being able to display is a problem of a different dimension
than not being able to get filenames across intact. Moreover,
two parties exchanging filenames, say, in Chinese/Finnish/Thai/...
are likely to have necessary fonts.

  Jungshik Shin



This archive was generated by hypermail 2.1.2 : Wed Jul 10 2002 - 11:35:56 EDT