RE: Roundtripping Solved

From: Lars Kristan (
Date: Tue Dec 21 2004 - 01:43:34 CST

  • Next message: Lars Kristan: "RE: Is it roundtripping or transfer-encoding"

    Mike Ayers wrote:
    > Things that are impossible that I've noticed so far:
    > - A metainformation system without holes in it.

    UNIX filesytems (ok, old ones) are an example of an information system that
    does not have metainformation about the encoding.

    As for the holes, there are some gray areas in my solution, but they can be
    worked out.

    > - Addressing files with intermixed locales reliably.
    > In a UTF-8 and ISO 8859-1 mixed environment, for instance,
    > there is no way to know whether <c3> <a9> indicates "" or
    > "é". The Unix locale architecture does not permit mixed
    > locales. What you propose is a locale of "ISO 8859-1 or
    > UTF-8, your guess is as good as mine".

    On UNIX, addressing files has nothing to do with locales. Each file can be
    addressed reliably, in any locale (*). It is only the interpretation that is
    not reliable. And UNIX locale architecture definitely DOES permit mixed
    locales. Hence the issue. And the "ISO 8859-1 or UTF-8, your guess is as
    good as mine" is not something I am trying to introduce. It is already
    there. What I am trying, is to allow that confusion to endure a while
    longer. Which is not bad in itself. I think it can actually help make it
    quicker, not slower.

    (*) MBCS can have some issues. Similar to those of UTF-8. But, A - a lot of
    it does work, B - what doesn't is a pain, C - those users typically only mix
    a MBCS and ASCII (so, no mix at all). Europe on the other hand, already
    mixes several Latin encodings. When that gets mixed with UTF-8, problems
    will be more frequent than they are with MBCS.

    > - A scheme that translates all possible Unix
    > filenames to unique and consistent Windows filenames. Case
    > issues alone kill this.
    Well, Windows actually does have the ability to handle filenames with case
    sensitivity. But yes, it is not used widely.

    A reliable translation of UNIX filenames to Windows filenames is just one of
    possible goals (or uses) of my approach. If a 100% reliable solution cannot
    be found, it does not mean that we shouldn't be looking for the next best

    My specific requirements were to store UNIX filenames in a Windows database
    and allow proper display of them, on Windows. Case issues, '*' in filenames
    and such, all those represent no problem in that part of the requirements.
    I've seen filenames consisting solely of a newline. And can deal with them.

    But let's do talk about translating UNIX filenames to Windows filenames.
    Users that need the interoperability have learned not to use tricky
    filenames, not to use filenames that differ only in the case used (which is
    also a bad idea in itself, it doesn't process well in our brain). So they
    adapted and have no problems now. But they have been using legacy encodings.
    Even more than one, especially when they have lots of files and are using a
    language where only a few letters are non-ASCII and were always able to
    figure out which file is which. It only affected the display, never
    accessing. Well, a switch to UTF-8 will bring up lots of issues for them.
    You think they will welcome the day and say "finally, I can solve this
    mess". I think they will say "oh darn, it all worked before, is this really

    Getting rid of legacy encodings is a goal. But not for many users. For most
    of them filenames are just a tool. Their business comes first. Some can't
    afford to dedicate a day to convert all the filenames.


    This archive was generated by hypermail 2.1.5 : Tue Dec 21 2004 - 01:45:16 CST