RE: UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtri pping in Unicode)

From: Lars Kristan (
Date: Wed Dec 15 2004 - 05:38:30 CST

  • Next message: D. Starner: "RE: Roundtripping in Unicode"

    Edward H. Trager wrote:
    > UTF-8's home directory). So both users could probably guess
    > the filename
    > they were looking at.
    Which, BTW, is true for most of Europe but is not true for some other
    combinations of locales.

    > d??claration_des_droits.utf8
    > The terminal, being set to interpret the legacy locale, does not know
    > how to interpret the two bytes that are used for the UTF-8 "".

    This is well known but is only the start of what the thread was discussing.

    Your example only shows a difference in interpretation. You are still able
    to copy and paste the filename, use it in scripts and open in it in any

    Now switch your locale to Latin 1 and create a file with that name in Latin
    1. Switch back to UTF-8 and try doing various things with this file. I
    assume the following happens:

    1 - Instead of letters being misinterpreted, they are lost. Leading to empty
    filenames in extreme cases.
    2 - You cannot open the file by copying its name from the terminal.
    3 - You can probably still specify it in scripts (which need to be edited in
    Latin 1), but if someone would start validating the script when in UTF-8
    locale, you would lose that ability.
    4 - Most C programs should be able to process the file. But I would not bet
    on some more 'advanced' languages. The more they comply with Unicode, the
    less likely it is they will open the file.
    5 - Windows is likely having problems accessing that file.

    And, yes, the solution is still to convert all filenames to UTF-8. That is,
    if all users on a particular system agree that this is what should be done
    with their files. But does not prevent such files from being generated,
    whatever the reason or cause is.


    This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 05:47:44 CST