Re: Fw: Unicode filename problems

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Jun 03 2003 - 07:25:23 EDT

  • Next message: jameskass@att.net: "Re: Rare extinct latin letters"

    Noe the following ambiguity in the ZIP file format specification:
    [QUOTE]
    file name: (Variable)
    The name of the file, with optional relative path. The path stored should not contain a drive or device letter, or a leading slash. All slashes should be forward slashes '/' as opposed to backwards slashes '\' for compatibility with Amiga and UNIX file systems, etc. If input came from standard input, there is no file name field.
    [/QUOTE]
    There's no clear indication of the encoding used for the filenames in ZIP files. So if the "Created by" field is 0, it assumes the DOS semantics (and no support for UTF-8, but the exact codepage is still ambiguous, and it will probably be displayed with the local codepage of the sysem on which the ZIP file is read)

    The other common problems are that many bogous ZIP tools also forget the recommandations indicated in the spec, and encode an absolute filename (with leading / or sometimes even a drive letter), or even keep backslashes as directory separators.

    For this reason, the Java "JAR" tool and library reduced the supported "features" for ZIP files it creates to a common portable format assuming a format similar to relative URLs on the web (but without the URL encoding with %NN decimal-encoded bytes).

    This is coherent with the usage of ZIP files on Unix, and Linux, where the encoding is also not strictly defined in the UFS-like filesystems which just store byte sequences for filenames, that will be rendered according to the locale preferences of the local user. On those Unix-like systems, the current locale plays an imporant role to interpret filenames from the filesystem!

    This is very unlike NTFS or FAT32 longnames on Windows, and Apple HFS for Mac OS, which both explicitly use Unicode (normalized to NFC and serialized with the UTF-16LE encoding scheme on Windows, or normalized to "Apple-HFS-NFD" and serialized with the UTF-8 encoding scheme on Mac OS (but with some restrictions as the "Apple-HFS-NFD" form is a partial decomposition form which was based on Unicode 2.1 and has not been, and will not be, extended to include a larger Unicode set, so for interoperability resaons, the newer Unicode characters will be left in their current normalized form when sent to the filesystem to create unique filenames).

    I do think that the NFC form is the best way to handle internationalized filenames that will fit with most OSes (including HFS, because the specific Apple-HFS normalization format is internal to its storage and applications can safely use only precomposed filenames or reconvert them back to NFC when reading a HFS catalog.

    The least ambiguous format is then the one using a NTFS signature on Windows, and most Zip tools for Windows will now (in their current version) read such zip catalogs correctly even on Windows 95/98/98SE/ME.

    The Joliet extension to ISO9660/HSFS is an additional catalog that provides NTFS/FAT32-like long filenames on top of the basic ISO9660 filename format. It does not dispense the application to create "short names" using the portable ASCII-based filenames, in a way similar to what FAT32 does on Windows to complement the basic FAT format used on DOS/Windows 3.x with an ambiguous "OEM codepage" encoding.

    When Windows reads a CDROM catalog, it will first display the Joliet catalog if present, else the basic ISO9660 catalog. The RockRidge extension is ignored.

    When Unix/Linux reads a CDROM catalog, most often it will first display a RockRidge catalog if present (which allows mapping UFS semantics and attriutes), ignoring the Joliet catalog, and then fallback to the basic ISO9660 catalog. On Linux, there are other methods to get less basic filenames, including a convention to store an additional catalog file (which is a simple Unix plain/text file mapping ISO9660 names to long Unix names, but with once again an ambiguous encoding), or additional filesystem drivers to also consider the now common Joliet extension, based on Unicode.

    So if you use Linux to create CDROM images containing both a RockRidge and Joliet catalog extension, it is normal that you see the correct names on Linux with the same locale. But as you can see, the current locale is important for correct display of the CDROM catalog. Windows does not use this RockRidge catalog by default (unless you install a driver that supports it).

    If the Windows output is "garbled", it just means that the Joliet extension created by your Linux tool is not correctly encoded (meaning that your Linux tool is bogous when it computes the Joliet extension, or that your system lacks some support libraries, or that the Joliet extension creation works only from a specific locale, and limited to the ISO-8859-1 set, and does not really support Unicode, but just consists in adding a trailing 00 byte to each character to map ISO-8859-1 to UTF-16LE).

    Normally, Joliet extensions can be read on any localization of Windows, independantly of the current Windows locale, even in command-line mode, where the Unicode-based filename is mapped/converted to the current OEM set (which can be changed by the CHCP command-line tool).

    To Edward: there's no user-settable encoding in a user environment on Windows. Filesystems on Windows are specified to use a global host setting, or a encoding fixed by the filesystem type. The OS will make the appropriate conversions when needed, by presenting to the application the Unicode filename label, which must be coherent at least with the ASCII-based encoding of the fallback short name which is an equivalent name to access to the same file. So on Windows, each file can have multiple filenames, and this must not create collisions.

    Filesystem encodings (and in some cases too the URLs of some websites that do not respect the correct labeling for heir page and form encodings) are really a nightmare. There is no easy solution other than use filenames only as keys without user-readable semantics. So it is much better to create CDROMs that use only the portable ISO9660 format, and use an additional mapping file to display user-readable labels, that will be stored in this mapping file as UTF-8 or one of the 3 UTF-16 encoding schemes. Your application can then implement a URL resolver to use these names if it uses a web-like navigational system.

    -- Philippe.
    ----- Original Message -----
    From: "Edward H Trager" <ehtrager@umich.edu>
    > On Fri, 30 May 2003 Peter_Constable@sil.org wrote:
    > > I wonder if anyone here has ideas on these matters.
    > > Peter
    > >
    > > ----- Forwarded by Peter Constable/IntlAdmin/WCT on 05/30/2003 10:56 PM
    > > I have 3 LinguaLinks lexicons that I have converted into HTML pages - one
    > > for each entry. The languages use non-ANSI characters, so I also did a
    > > Unicode conversion at the same time.
    > > [snip]
    > >
    > > Everything works very well except that I cannot burn the files onto a CD
    > > because of the unicode values in the filenames. Roxio and Nero CD-burners
    > > don't accept some of the higher values found in the file names (using
    > > Jolliet, ISO9600 and UDF). Anyone have any ideas how to deal with this?
    > > For example, a filename with unicode value 026B, a tilde lower case L,
    > > causes problems.
    >
    > I did a test burning of over 40 UTF-8 file names in seven different
    > scripts (Arabic, Simplified & Traditional Chinese, Greek, Japanese, Latin,
    > and Thai) to a CD in ISO9660 format with both Rockridge (Unix) and Joliet
    > (MS) extensions using Joerg Schilling's Open Source "mkisofs" and
    > "cdrecord" version 2.0 tools
    > (http://www.fokus.gmd.de/research/cc/glone/employees/joerg.schilling/private/mkisofs.html)
    > on Linux (SuSE 7.3).
    >
    > The resulting CD preserved the UTF-8 filenames perfectly: I could view the
    > file names using both "ls" from mlterm (http://mlterm.sourceforge.net/)
    > and from the Mozilla browser when run under a UTF-8 locale (en_US.UTF-8)
    > on Linux.
    >
    > The file names did not appear correct on Windows though, but I think this
    > is only because I don't know how to set the locale properly on Windows
    > 2000.
    >
    > Note that I didn't do anything special when burning the CD: I just burned
    > it using the same options (Rockridge and Joliet extensions) that I always
    > use, and there was no need to zip or tar the files. Email me if you need
    > the details of how to do it.



    This archive was generated by hypermail 2.1.5 : Tue Jun 03 2003 - 08:18:34 EDT