Re: Roundtripping in Unicode

From: Peter Kirk (peterkirk@qaya.org)
Date: Tue Dec 14 2004 - 07:07:16 CST

Next message: John Cowan: "Re: Roundtripping in Unicode"

Previous message: Edward H. Trager: "UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtripping in Unicode)"
In reply to: Arcane Jill: "RE: Roundtripping in Unicode"
Next in thread: John Cowan: "Re: Roundtripping in Unicode"
Reply: John Cowan: "Re: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 14/12/2004 11:32, Arcane Jill wrote:

> I've been following this thread for a while, and I've pretty much got
> the hang of the issues here. To summarize:

I haven't followed everything, but here is my 2 cents worth.

I note that there is a real problem. I have had significant problems in
Windows with files copied from other language systems. Sometimes for
example these files are listed fine in Explorer but when I try to copy
or delete them they are not found, presumably because the filename is
being corrupted somewhere in the system and doesn't match.

>
> Unix filenames consist of an arbitrary sequence of octets, excluding
> 0x00 and 0x2F. How they are /displayed/ to any given user depends on
> that user's locale setting. In this scenario, two users with different
> locale settings will see different filenames for the same file, but
> they will still be able to access the file via the filename that they
> see. These two filenames will be spelt identically in terms of octets,
> but (apparently) differently when viewed in terms of characters.
>
> At least, that's how it was until the UTF-8 locale came along. If we
> consider only one-byte-per-character encodings, then any octet
> sequence is "valid" in any locale. But UTF-8 introduces the
> possibility that an octet sequence might be "invalid" - a new concept
> for Unix. So if you change your locale to UTF-8, then suddenly, some
> files created by other users might appear to you to have invalid
> filenames (though they would still appear valid when viewed by the
> file's creator).
>
This is not in fact a new concept. Some octet sequences which are valid
filenames are invalid in a Latin-1 locale - for example, those which
include octets in the range 0x80-0x9F, if "Latin-1" means ISO 8859-1.
Some of these octets are of course defined in Windows CP1252 etc, so a
Unix Latin-1 system may have some interpretation for some of them; but
others e.g. 0x81 have no interpretation in any flavour of Latin-1 as far
as I know. So there is by no means a guarantee that every non-Unicode
Unix locale has an interpretation of every octet, which implies that
other octets are invalid.

Now no doubt many Unix filename handling utilities ignore the fact that
some octets are invalid or uninterpretable in the locale, because they
handle filenames as octet strings (with 0x00 and 0x2F having special
interpretations) rather than as locale-dependent character strings. But
these routines should continue to work in a UTF-8 locale, as they make
no attempt to interpret any octets other than 0x00 and 0x2F.

> A specific example: if a file F is accessed by two different users, A
> and B, of whom A has set their locale to Latin-1, and B has set their
> locale to UTF-8, then the filename may appear to be valid to user A,
> but invalid to user B.
>
> Lars is saying (and he's probably right, because he knows more about
> Unix than I) that user B does not necessarily have the right to change
> the actual octet sequence which is the filename of F, just to make it
> appear valid to user B, because doing so would stop a lot of things
> working for user A (for instance, A might have created the file, the
> filename might be hardcoded in a script, etc.). So Lars takes a
> Unix-like approach, saying "retain the actual octet sequence, but feel
> free to try to display and manipulate it as if it were some UTF-8-like
> encoding in which all octet sequences are valid". And all this seems
> to work fine for him, until he tries to roundtrip to UTF-16 and back.

I think the problem here is that a Unix filename is a string of octets,
not of characters. And so it should not be converted into another
encoding form as if it is characters; it should be processed at a quite
different level of interpretation.

Of course a system is free to do what it wants internally.

>
> I'm not sure why anyone's arguing about this though - Phillipe's
> suggestion seems to be the perfect solution which keeps everyone
> happy. So...
>
> ...allow me to construct a specific example of what Phillipe suggested
> only generally:
>
> ...
>
> This would appear to solve Lars' problem, and because the three
> encodings, NOT-UTF-8, NOT-UTF-16 and NOT-UTF-32, don't claim to be
> UTFs, no-one need get upset.
>
All of this is ingenious, and may be useful for internal processing
within a Unix system, and perhaps even for interaction between
cooperating systems. But NOT-Unicode is not Unicode (!) and so Unicode
should not be expected to standardise it.

I can see that there may be a need for a protocol for open exchange of
Unix-like filenames. But these filenames should be treated as binary
data (which may or may not be interpretable in any one locale) and
encoded as such, rather than forced into the mould of Unicode characters
which it does not fit.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: John Cowan: "Re: Roundtripping in Unicode"
Previous message: Edward H. Trager: "UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtripping in Unicode)"
In reply to: Arcane Jill: "RE: Roundtripping in Unicode"
Next in thread: John Cowan: "Re: Roundtripping in Unicode"
Reply: John Cowan: "Re: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 14 2004 - 10:58:11 CST