From: Marcin 'Qrczak' Kowalczyk (email@example.com)
Date: Wed Dec 15 2004 - 07:33:34 CST
Lars Kristan <firstname.lastname@example.org> writes:
> Now, it is true that data from two applications using this technique can
> become intermixed. But this is not something we should fear. On the
> contrary, this is why I do what to standardize the approach. Because in most
> cases what will happen is exactly what one expects. If each of the two
> applications chose an arbitrary escaping technique to solve the problem,
> then you get a bigger mess.
If one application switches from standard UTF-8 to your modification,
and another application continues to use standard UTF-8, then the
ability to pass arbitrary Unicode strings between them by serializing
them to UTF-8 is lost. So you can't claim that does not affect
programs which don't adopt it. It would have to be adopted by all
programs which currently use UTF-8, or data exchange would break.
But it's not a viable replacement of UTF-8. Even if both applications
use your modification, the ability to serialize arbitrary sequences
of valid code points (i.e. not surrogates) through UTF-8 is lost: the
mapping to modified UTF-8 is not injective.
Which means that UTF-8 can't be replaced with your modification.
If they coexisted, expect trouble when the two slightly incompatible
The GNU implementation of Java treats filenames as Java-modified
UTF-8. This is broken in two ways. First, it's not usable in an
environment where filenames use e.g. ISO-8859-x. Next, it's not
correct even in a purely UTF-8 environment, because it encodes
characters above U+FFFF differently - it uses a *non-standard*
modification of UTF-8. Balkanization of UTF-8 is bad.
> Using my conversion, Windows can access any file on UNIX, because my
> conversion guarantees roundtrip UX=>Win=>UX
Well, with or without your conversion it's not true, because there
are various characters which are valid in Unix filenames but not in
Windows (e.g. ? * : \ and control characters). So if all filenames are
to be accessible, they have to introduce some escaping. And as soon
as an escaping scheme is used, it can be extended to encode isolated
bytes with high bit set.
> Win=>UX=>Win roundtrip is not guaranteed.
Currently it breaks only for isolated surrogates (assuming the Unix
is configured to use UTF-8). If Windows filenames are specified to be
UTF-16, the error is clearly on the Windows side and this side should
-- __("< Marcin Kowalczyk \__/ email@example.com ^^ http://qrnik.knm.org.pl/~qrczak/
This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 07:43:10 CST