Re: Roundtripping in Unicode

From: Peter Kirk (peterkirk@qaya.org)
Date: Tue Dec 14 2004 - 07:07:16 CST

  • Next message: John Cowan: "Re: Roundtripping in Unicode"

    On 14/12/2004 11:32, Arcane Jill wrote:

    > I've been following this thread for a while, and I've pretty much got
    > the hang of the issues here. To summarize:

    I haven't followed everything, but here is my 2 cents worth.

    I note that there is a real problem. I have had significant problems in
    Windows with files copied from other language systems. Sometimes for
    example these files are listed fine in Explorer but when I try to copy
    or delete them they are not found, presumably because the filename is
    being corrupted somewhere in the system and doesn't match.

    >
    > Unix filenames consist of an arbitrary sequence of octets, excluding
    > 0x00 and 0x2F. How they are /displayed/ to any given user depends on
    > that user's locale setting. In this scenario, two users with different
    > locale settings will see different filenames for the same file, but
    > they will still be able to access the file via the filename that they
    > see. These two filenames will be spelt identically in terms of octets,
    > but (apparently) differently when viewed in terms of characters.
    >
    > At least, that's how it was until the UTF-8 locale came along. If we
    > consider only one-byte-per-character encodings, then any octet
    > sequence is "valid" in any locale. But UTF-8 introduces the
    > possibility that an octet sequence might be "invalid" - a new concept
    > for Unix. So if you change your locale to UTF-8, then suddenly, some
    > files created by other users might appear to you to have invalid
    > filenames (though they would still appear valid when viewed by the
    > file's creator).
    >
    This is not in fact a new concept. Some octet sequences which are valid
    filenames are invalid in a Latin-1 locale - for example, those which
    include octets in the range 0x80-0x9F, if "Latin-1" means ISO 8859-1.
    Some of these octets are of course defined in Windows CP1252 etc, so a
    Unix Latin-1 system may have some interpretation for some of them; but
    others e.g. 0x81 have no interpretation in any flavour of Latin-1 as far
    as I know. So there is by no means a guarantee that every non-Unicode
    Unix locale has an interpretation of every octet, which implies that
    other octets are invalid.

    Now no doubt many Unix filename handling utilities ignore the fact that
    some octets are invalid or uninterpretable in the locale, because they
    handle filenames as octet strings (with 0x00 and 0x2F having special
    interpretations) rather than as locale-dependent character strings. But
    these routines should continue to work in a UTF-8 locale, as they make
    no attempt to interpret any octets other than 0x00 and 0x2F.

    > A specific example: if a file F is accessed by two different users, A
    > and B, of whom A has set their locale to Latin-1, and B has set their
    > locale to UTF-8, then the filename may appear to be valid to user A,
    > but invalid to user B.
    >
    > Lars is saying (and he's probably right, because he knows more about
    > Unix than I) that user B does not necessarily have the right to change
    > the actual octet sequence which is the filename of F, just to make it
    > appear valid to user B, because doing so would stop a lot of things
    > working for user A (for instance, A might have created the file, the
    > filename might be hardcoded in a script, etc.). So Lars takes a
    > Unix-like approach, saying "retain the actual octet sequence, but feel
    > free to try to display and manipulate it as if it were some UTF-8-like
    > encoding in which all octet sequences are valid". And all this seems
    > to work fine for him, until he tries to roundtrip to UTF-16 and back.

    I think the problem here is that a Unix filename is a string of octets,
    not of characters. And so it should not be converted into another
    encoding form as if it is characters; it should be processed at a quite
    different level of interpretation.

    Of course a system is free to do what it wants internally.

    >
    > I'm not sure why anyone's arguing about this though - Phillipe's
    > suggestion seems to be the perfect solution which keeps everyone
    > happy. So...
    >
    > ...allow me to construct a specific example of what Phillipe suggested
    > only generally:
    >
    > ...
    >
    > This would appear to solve Lars' problem, and because the three
    > encodings, NOT-UTF-8, NOT-UTF-16 and NOT-UTF-32, don't claim to be
    > UTFs, no-one need get upset.
    >
    All of this is ingenious, and may be useful for internal processing
    within a Unix system, and perhaps even for interaction between
    cooperating systems. But NOT-Unicode is not Unicode (!) and so Unicode
    should not be expected to standardise it.

    I can see that there may be a need for a protocol for open exchange of
    Unix-like filenames. But these filenames should be treated as binary
    data (which may or may not be interpretable in any one locale) and
    encoded as such, rather than forced into the mould of Unicode characters
    which it does not fit.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 10:58:11 CST