Re: Representing Unix filenames in Unicode

From: Neil Harris (
Date: Mon Nov 28 2005 - 13:49:02 CST

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Representing Unix filenames in Unicode"

    Hans Aberg wrote:
    > On 28 Nov 2005, at 03:44, Doug Ewell wrote:
    >> Whatever you guys decide, please let's not have any proposals to
    >> "improve" UTF-8, or invent a mutant form of UTF-8, by giving it a way
    >> to map these arbitrary byte sequences bijectively while
    >> simultaneously retaining the existing properties of UTF-8. We had
    >> that discussion a while back. The first one to suggest "fixing"
    >> UTF-8 automatically loses.
    > My guess is that it is simplest to store UTF-8 names as is as
    > byte-strings on the low level, possibly with some information whether
    > it is ASCII or UTF-8 (or possibly some encoding), which is important
    > in UNIX. Then the problem arises what to do when low filenames appear
    > which cannot be given UTF-8 interpretation. Letting the low level file
    > handling having to bother with that seems to be a bad idea: it does
    > not need that, and interpretations will just complicate and slow
    > things down. So then the idea I presented is to simply encode this to
    > consistent UTF-8 in way that the original byte string can be converted
    > back. A UNIX context may though need more than one invertible
    > byte-string UTF-8 encoding, say if one is considering filenames,
    > filepaths or filepath sequences. The question is truly tricky though.
    > One must think through waht will happen with all standard UNIX
    > programs that interprets byte strings and character strings. So I
    > would prefer to leave it to those UNIX experts to work it out.
    > Hans Aberg
    The set of ASCII strings is a proper subset of the set of UTF-8 strings,
    so no information would need to be stored about which of those coding
    was being used.

    Now, ISO 8859-1, that's a different matter -- I suppose you could still
    use the property that _almost all_ non-pure-ASCII ISO 8859-1 natural
    language strings are not also valid UTF-8 strings for backwards
    compatibility, and ditto for most other fixed 8-bit encodings, but I
    certainly wouldn't be willing to trust my filesystem to this sort of hack.

    -- Neil

    This archive was generated by hypermail 2.1.5 : Mon Nov 28 2005 - 18:57:59 CST