Re: [OT] Unicode filename problems

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat May 31 2003 - 10:01:02 EDT

  • Next message: Roozbeh Pournader: "Unicode and Iranian Blogs"

    Zip files should have no problems to contain files with UTF-8 names.

    In fact the encoding allows it, and the only reason why you can't do it is the limitation of the ZIP tool you use which blindly uses only the encoding of the filesystem from which the file is created.

    Use the "jar" zip tool from the Java SDK, or its Java interface classes to create these files. You will have no problem to store UTF-8 filenames within the internal ZIP table of content.

    When you open a ZIP file, with most popular ZIP tools for desktops, it will still be able to process the content, and convert the internal filenames "the best as it knows" to actual filenames for your filesystem when extracting files.

    You hae no obligation to extract the compressed files of the ZIP to the filesystem (with its encoding limitations). You may just read directly from the zip in your application.

    The name of the ZIP archive is not relevant: you don't really need to internationalize it, and can restrict it to ASCII with a classic .zip or .jar extension.

    Your CDROM can then store this unique .zip or .jar without any problem (even without using Jolliet or UDF extensions that would no work on some *nix or Mac OS, or on old Windows 95, as only the legacy basic ISO9660 base format is supported universally), and you can use a protocol handler in your browsing app to browse its content using more intuitive names and without extracting files of your compressed archive stored in the CDROM.

    Noe also that it is generally not recommanded to encode filenames with Unicode, if you want cross-platform compatibility. The filename is just a referential key, but not the best support to store an actual title meta-data string. You can use a mapping file that translate these nicknames to actual titles, for search purposes.

    Note however that large ZIP files can be long to access, because zip files must be read sequentially to get through each file in the archive. I think that the same is true for the .cab format. Storing an index file as the first file of your archive may speed up searches and random accesses in a large collection of large ZIP files.

    This discussion is quite out of topic. Unicode does not consider filesystem limitations, but gives normative hints for its correct implementation. Filesystems that say supporting Unicode should support it as per the Unicode specification, or these filesystems are flawed and not conforming.

    -- Philippe.
    ----- Original Message -----
    From: "Raymond Mercier" <RaymondM@compuserve.com>
    To: <unicode@unicode.org>
    Sent: Saturday, May 31, 2003 1:18 PM
    Subject: Re: Fw: Unicode filename problems

    > This question of non-Ascii filenames is a real problem : hardly any
    > software out there can cope with this.
    > I did not know of RAR, but have given it a try. Even here there is a
    > serious problem, because if the filename is non-Ascii the name of the
    > compressed file comes out as _____.rar, with as many underlines as there
    > were characters in the original name. In fact it is a bit less predictable
    > : if the name is Greek, for example, you get Latin letters, if it is
    > Cyrillic, just the underline.
    > This is useless then if you have a number of filenames all with the same
    > number of characters.
    > Certainly more work is needed on RAR (at least on the Win 2000 version).
    >
    > I know about that, since I made my Fontlist 5 work properly with arbitrary
    > non-ascii names :
    > http://ourworld.compuserve.com/homepages/RaymondM/fontlist5.htm .
    >
    > Raymond Mercier
    >
    >
    > At 22:58 30/05/2003 -0500, you wrote:
    > >
    > >I wonder if anyone here has ideas on these matters.
    > >
    > >Peter
    > >
    > >----- Forwarded by Peter Constable/IntlAdmin/WCT on 05/30/2003 10:56 PM
    > >-----
    > >
    > >
    > >I have 3 LinguaLinks lexicons that I have converted into HTML pages - one
    > >for each entry. The languages use non-ANSI characters, so I also did a
    > >Unicode conversion at the same time.
    > >
    > >[snip]
    > >
    > >Everything works very well except that I cannot burn the files onto a CD
    > >because of the unicode values in the filenames. Roxio and Nero CD-burners
    > >don't accept some of the higher values found in the file names (using
    > >Jolliet, ISO9600 and UDF). Anyone have any ideas how to deal with this?
    > >For example, a filename with unicode value 026B, a tilde lower case L,
    > >causes problems.
    > >
    > >In the meantime, to get it onto CD, I decided to try and zip all the
    > >files. Turns out almost all the zippers out there DO NOT support Unicode
    > >filenames. Doug Rintoul found WinRAR
    > >(http://www.rarlab.com/rar_archiver.htm) which does the trick in the RAR
    > >format only. There is a RAR expander for Macintosh and Linux systems as
    > >well (all of these are $29 USD). So far, have not found a freeware
    > >solution that meets unicode filename needs. Have any of you run into this
    > >yet?
    > >
    > >I could try to determine what Unicode values are causing problems on the
    > >CD burner and do an unacceptable-to-acceptable character translation in
    > >the filenames and the links to those filenames ... but that seems like a
    > >huge compromise. Also, it will be difficult to come up with a generic
    > >solution ... that is to say, I don't know what RANGE of values are
    > >unacceptable for characters in a CD filename. Jolliet is supposed to allow
    > >Unicode filenames according to the documentation I have seen.
    > >
    > >Larry
    >
    >



    This archive was generated by hypermail 2.1.5 : Sat May 31 2003 - 10:55:39 EDT