From: Lars Kristan (lars.kristan@hermes.si)
Date: Wed Dec 15 2004 - 10:24:13 CST
Marcin 'Qrczak' Kowalczyk wrote:
> If one application switches from standard UTF-8 to your modification,
> and another application continues to use standard UTF-8, then the
> ability to pass arbitrary Unicode strings between them by serializing
> them to UTF-8 is lost. So you can't claim that does not affect
> programs which don't adopt it. It would have to be adopted by all
> programs which currently use UTF-8, or data exchange would break.
I don't think so. If I produce UTF-8 data from filenames, and give it to an
UTF-8 application, nothing can be lost in the portion of this architecture
that deals with Unicode data. Now, if you expect that you can give me
Unicode data and I should store it in a filesystem (as a filename), then
you're in error. It is definitely true that you can create a sequence of
valid Unicode characters from my range and I will not be able to give it
back. But I will also have to reject any '/' characters you feed me. You are
misusing my application.
If some application chooses to use my conversion and looses or
misinterpretes your data, then it is broken and shouldn't use that
conversion or should not declare that particular interface as Unicode
interface.
> But it's not a viable replacement of UTF-8. Even if both applications
> use your modification, the ability to serialize arbitrary sequences
> of valid code points (i.e. not surrogates) through UTF-8 is lost: the
> mapping to modified UTF-8 is not injective.
Yes, that is true. But there are people who would be willing to accept that
since it only happens if those 128 codepoints are used. Those can use the
conversion, others needn't.
OK, there is one problem that I *do* see with the use of my conversion. I
map a file from UX to Win. You then use not my application, but another one,
which copies the file back from Win to UX (and that is easier, so you *can*
use this application). Now the invalid sequence is already escaped. If I map
this new file to Win again, I need to escape the escape. They can start
piling up.
Of course you can realize the problem, and simply rename the file, you can
undo the over-escaping (no data is ever lost!), and probably rename that
file to valid UTF-8, which is what you want anyway. And, you can do it even
from the Windows system. If you prevent my solution, you will not have my
program in the first place, meaning you will need to go to the UNIX system
to rename the file, and that even in order to access it in the first place.
Actually, there are two subflavors of my conversion possible (I can hear you
say "oh, noooo"). One does escape the escapes, the other doesn't. This
second flavor can be used by applications that need to make UTF-8 from an
arbitrary input, but do not need to re-create the original byte sequence.
Basically, they are preserving all the data, except for the information how
many times the original invalid sequences were escaped. There may be a need
for such applications and they would in fact reduce the re-escaping problem.
> Which means that UTF-8 can't be replaced with your modification.
> If they coexisted, expect trouble when the two slightly incompatible
> encodings meet.
Or, expect trouble when dealing with data that is not guaranteed to be
UTF-8. Or hope that there will be no such data, in near future, and I mean
none.
> > Using my conversion, Windows can access any file on UNIX, because my
> > conversion guarantees roundtrip UX=>Win=>UX
>
> Well, with or without your conversion it's not true, because there
> are various characters which are valid in Unix filenames but not in
> Windows (e.g. ? * : \ and control characters). So if all filenames are
> to be accessible, they have to introduce some escaping. And as soon
> as an escaping scheme is used, it can be extended to encode isolated
> bytes with high bit set.
Good point. But you are assuming I copy the files to Windows filesystem. I
don't. I have no problems if you specify your filename with any of the above
characters, even from Windows.
And, BTW, suppose UTF-8 validation is introduced (as an option) on UNIX
filesystems. The characters you mention (and some other, I can tell you
exactly which don't work on Windows) could again be (optionally) rejected on
UNIX filesystems.
> > Win=>UX=>Win roundtrip is not guaranteed.
>
> Currently it breaks only for isolated surrogates (assuming the Unix
> is configured to use UTF-8). If Windows filenames are specified to be
> UTF-16, the error is clearly on the Windows side and this side should
> be fixed.
And in my case, it would break for some malicious sequences of the 128
codepoints. Equally rare, and with equal minor consequences. Ummmm, and it
can be fixed, too. Such malicious sequences could be forbidden in contexts
where we fear they might cause problems.
Lars
This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 10:32:12 CST