RE: Roundtripping in Unicode

From: Lars Kristan (lars.kristan@hermes.si)
Date: Thu Dec 16 2004 - 05:05:47 CST

  • Next message: Lars Kristan: "RE: Roundtripping Solved"

    Marcin 'Qrczak' Kowalczyk wrote:
    > Yes, IMHO all general-purpose languages should support processing
    > arrays of bytes, in addition to Unicode strings.

    C is likely to retain the behavior of the str functions. Although, it puts a
    lot of burden on the developers to identify all opaque strings and really
    handle them with those functions throughout the application (or even worse,
    a suite of applications not neccessarily written by the same company).

    Newer languages are probably often designed with an assumption that all you
    need is a good class for Unicode strings. Instead of making them change that
    assumption, we could consider finding a way to make that true.

    If a solution that doesn't break anything in Unicode cannot be found, then
    consider a solution that does break something, but check what the part that
    is broken really affects. For example, we assume it MUST be possible to
    represent a valid Unicode string in any UTF stream and get it back. Suppose
    you find a solution that retains that capability for all Unicode codepoints
    except for 128. If you know that those will ONLY be used for a particular
    purpose, you might be willing to accept that those who use those codepoints
    will deal with the problem and for those who don't the rules didn't really
    change. What I am saying is that we need to preserve the intention of the
    existing rules, not the rules themselves.

    But again, this is if I was proposing that everybody starts using my
    conversion everywhere. Which at this point I am not.

    >
    > It's not clear however how the API of filenames should look like,
    > especially if they wish to be portable to Windows.

    I intend to bring up the issue in near future. And try to let everyone catch
    some breath before that.

    > or delimit the filename with "\0", or prefix it with
    > the length, or something like this.

    I don't see why that would be necessary or useful.

    > A backup software should do this
    > and not pay attention to the locale. But for end-user software like
    > an image viewer, processing arbitrary filenames is less important.

    You have to pay attention to the locale eventually. You need to report which
    file failed to be backed up (or is infected with a virus). And you should be
    able to let the user restore a single file. If you don't interpret it
    according to the locale (possibly UTF-8), user won't know how to select what
    she wants. Even worse if one wants to enter the filename manually. All this
    CAN be done within the application, but is very cumbersome. It gets worse if
    you want to pass some information to another software, since the other
    application may not have an interface to accept the opaque strings. If it
    does, the convention may differ. This is why I am saying that something
    should be standardized. Of course standardizing a poor solution is not a
    good idea. We should do our best to find a good one.

    > Technically they are binary (command line arguments must not contain
    > zero bytes). Users are expecting stdin and stdout to be treated as
    > text or binary depending on the program, while command like arguments
    > are generally interpreted as text or filenames.

    So, an application outputting filenames has a binary stdout and no text
    application is guaranteed to process this output.

    Lars



    This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 05:09:20 CST