UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtripping in Unicode)

From: Edward H. Trager (ehtrager@umich.edu)
Date: Tue Dec 14 2004 - 10:32:08 CST

  • Next message: Peter Kirk: "Re: Roundtripping in Unicode"

    On Tuesday 2004.12.14 12:50:43 -0000, Arcane Jill wrote:
    > If I have understood this correctly, filenames are not "in" a locale, they
    > are absolute. Users, on the other hand, are "in" a locale, and users view
    > filenames. The same filename can "look" different to two different users.
    > To user A (whose locale is Latin-1), a filename might look valid; to user B
    > (whose locale is UTF-8), the same filename might look invalid.

    Correct. The problem will however be limited to the accented
    Latin characters present in ISO-8859-1 beyond the ASCII set. The basic Latin
    alphabet in the ASCII set
    at the beginning of both ISO-8859-1 and UTF-8 will appear unchanged to both
    users (UTF-8 user looking at Latin-1's home directory, or Latin-1 looking at
    UTF-8's home directory). So both users could probably guess the filename
    they were looking at. For example, here is a file on my local machine,
    a Linux box with the locale set to LANG=en_US.UTF-8:


    The accented "e" in "déclaration" appears correctly under the UTF-8 locale.

    I then copied this file (using scp) over to an older Sun Solaris box which I do not administer,
    so I have to live with the "C" POSIX locale that they have got that machine
    set to. Now, when I
    view the file names in a terminal (where the terminal emulator is set to
    the same locale), I see:


    The terminal, being set to interpret the legacy locale, does not know
    how to interpret the two bytes that are used for the UTF-8 "é".
    Still, I can guess that the first word should be "déclaration".

    The solution, as has been pointed out, is for everyone to move to
    UTF-8 locales. In the Linux and Unix world, this is already happening
    for the most part. Solaris 10 now defaults to a UTF-8 locale, at least
    when set to English. Both SuSE and Redhat default to UTF-8 locales
    for most language and script environments. And (open source) tools exist for
    converting file names from one encoding to another encoding on Linux
    and Unix systems. A group of Japanese developers is working on an NLS implementation
    for the BSDs like OpenBSD which are currently "stuck" with nothing but the "C"
    POSIX locale. I think the name of that project is "Citrus".

    -- Ed Trager


    > Is that right, Lars?
    > If so, Marcin, what exactly is the error, and whose fault is it?
    > Jill
    > -----Original Message-----
    > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
    > Behalf Of Marcin 'Qrczak' Kowalczyk
    > Sent: 13 December 2004 14:59
    > To: unicode@unicode.org
    > Subject: Re: Roundtripping in Unicode
    > Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.

    This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 10:07:05 CST