Re: Representing Unix filenames in Unicode

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 30 2005 - 09:43:56 CST

  • Next message: Philippe Verdy: "Re: Character delta between Unicode 4.1 and 5.0"

    From: "Antoine Leca" <Antoine10646@Leca-Marti.org>
    > What I meant was more like "Humans should not so things that (perhaps
    > other)
    > humans cannot understand later."
    >
    > And as Doug Ewell said, trying to correct these kind of actions is often
    > fruitless, and very often misleading if not broken.

    Completey agree. Trying to fix UTF-8 for such thing is bogous at its basic
    design becauseit breaks its inherent stability and completeness for its
    intended purpose.

    What Chris Jacobs and Hans Aberg are trying to defend is a bad design
    decision: trying to mix in the same representation two things that belong to
    distinct implementation levels. UTF-8 is meant to represent Unicode-encoded
    texts. Nothing more.

    If you need to represent other kind of data in some text representation, you
    need an upper layer protocol on top of UTF-8, but you MUST NOT break UTF-8
    itself by relaxing some of its encoding rules. (When doing that, you think
    you are creating a bijection, you're wrong, as soon as you admit that there
    are exceptions: those unhandled names are even more dangerous in a security
    perspective!)

    Upper-layer protocols already do exist today, and they do provide a TRUE
    (and PROVEN) bijection with ALL possible filenames supported by ALL
    filesystems:
    * shell escaping syntaxes
    * various MIME encodings (including "Quoted-Printable", however I don't like
    the way it uses the = sign, as it interacts very badly in shell commands)
    * URL encoding syntaxes (notably with the "file:" URI namespace prefix)

    I would recommand the third option it for interaction with filesystems,
    because it can be degraded cleanly to simpler (and user-friendly) syntaxes
    on filenames that do not cause problems, notably file names that are using
    strict UTF-8 encoding, in the stable NFC form, not starting by "file:" and
    not containing confusable or invisible format control characters. For
    filenames that do not respect those conditions, the URL encoding will always
    be non confusable.

    The third option also interacts cleanly with shell commands under Unix
    (inherently allowing escaping more characters that may have special syntaxic
    meanings in a shell, such as quotation marks, dollar signs, braces,
    pipes...).

    For Windows, where the "%" sign as a special meaning in command lines, one
    could replace it with "$", and make sure that litteral % and $ signs in
    filenames are both URL-encoded ("$" is also used under Unix/Linux shells for
    variable substitution, in a way quite comparable to "%" on Windows). Yes it
    breaks the normal URL-encoding but this would only create an alternate
    URL-encoding form, or one could use another Shell than COMMAND or CMD. But
    this would not affect filesystem APIs that would accept URLs instead of
    native filenames.



    This archive was generated by hypermail 2.1.5 : Wed Nov 30 2005 - 09:52:20 CST