Re: Roundtripping in Unicode

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Wed Dec 15 2004 - 07:33:34 CST

Next message: Lars Kristan: "RE: Roundtripping in Unicode"

Previous message: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping Solved"
In reply to: Lars Kristan: "RE: RE: RE: Roundtripping in Unicode"
Next in thread: Philippe VERDY: "Re: RE: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Lars Kristan <lars.kristan@hermes.si> writes:

> Now, it is true that data from two applications using this technique can
> become intermixed. But this is not something we should fear. On the
> contrary, this is why I do what to standardize the approach. Because in most
> cases what will happen is exactly what one expects. If each of the two
> applications chose an arbitrary escaping technique to solve the problem,
> then you get a bigger mess.

If one application switches from standard UTF-8 to your modification,
and another application continues to use standard UTF-8, then the
ability to pass arbitrary Unicode strings between them by serializing
them to UTF-8 is lost. So you can't claim that does not affect
programs which don't adopt it. It would have to be adopted by all
programs which currently use UTF-8, or data exchange would break.

But it's not a viable replacement of UTF-8. Even if both applications
use your modification, the ability to serialize arbitrary sequences
of valid code points (i.e. not surrogates) through UTF-8 is lost: the
mapping to modified UTF-8 is not injective.

Which means that UTF-8 can't be replaced with your modification.
If they coexisted, expect trouble when the two slightly incompatible
encodings meet.

The GNU implementation of Java treats filenames as Java-modified
UTF-8. This is broken in two ways. First, it's not usable in an
environment where filenames use e.g. ISO-8859-x. Next, it's not
correct even in a purely UTF-8 environment, because it encodes
characters above U+FFFF differently - it uses a *non-standard*
modification of UTF-8. Balkanization of UTF-8 is bad.

> Using my conversion, Windows can access any file on UNIX, because my
> conversion guarantees roundtrip UX=>Win=>UX

Well, with or without your conversion it's not true, because there
are various characters which are valid in Unix filenames but not in
Windows (e.g. ? * : \ and control characters). So if all filenames are
to be accessible, they have to introduce some escaping. And as soon
as an escaping scheme is used, it can be extended to encode isolated
bytes with high bit set.

> Win=>UX=>Win roundtrip is not guaranteed.

Currently it breaks only for isolated surrogates (assuming the Unix
is configured to use UTF-8). If Windows filenames are specified to be
UTF-16, the error is clearly on the Windows side and this side should
be fixed.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Lars Kristan: "RE: Roundtripping in Unicode"
Previous message: Marcin 'Qrczak' Kowalczyk: "Re: Roundtripping Solved"
In reply to: Lars Kristan: "RE: RE: RE: Roundtripping in Unicode"
Next in thread: Philippe VERDY: "Re: RE: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 07:43:10 CST