RE: Roundtripping in Unicode

From: Lars Kristan (
Date: Wed Dec 15 2004 - 10:24:13 CST

  • Next message: Doug Ewell: "Re: Roundtripping Solved"

    Marcin 'Qrczak' Kowalczyk wrote:
    > If one application switches from standard UTF-8 to your modification,
    > and another application continues to use standard UTF-8, then the
    > ability to pass arbitrary Unicode strings between them by serializing
    > them to UTF-8 is lost. So you can't claim that does not affect
    > programs which don't adopt it. It would have to be adopted by all
    > programs which currently use UTF-8, or data exchange would break.

    I don't think so. If I produce UTF-8 data from filenames, and give it to an
    UTF-8 application, nothing can be lost in the portion of this architecture
    that deals with Unicode data. Now, if you expect that you can give me
    Unicode data and I should store it in a filesystem (as a filename), then
    you're in error. It is definitely true that you can create a sequence of
    valid Unicode characters from my range and I will not be able to give it
    back. But I will also have to reject any '/' characters you feed me. You are
    misusing my application.

    If some application chooses to use my conversion and looses or
    misinterpretes your data, then it is broken and shouldn't use that
    conversion or should not declare that particular interface as Unicode

    > But it's not a viable replacement of UTF-8. Even if both applications
    > use your modification, the ability to serialize arbitrary sequences
    > of valid code points (i.e. not surrogates) through UTF-8 is lost: the
    > mapping to modified UTF-8 is not injective.
    Yes, that is true. But there are people who would be willing to accept that
    since it only happens if those 128 codepoints are used. Those can use the
    conversion, others needn't.

    OK, there is one problem that I *do* see with the use of my conversion. I
    map a file from UX to Win. You then use not my application, but another one,
    which copies the file back from Win to UX (and that is easier, so you *can*
    use this application). Now the invalid sequence is already escaped. If I map
    this new file to Win again, I need to escape the escape. They can start
    piling up.

    Of course you can realize the problem, and simply rename the file, you can
    undo the over-escaping (no data is ever lost!), and probably rename that
    file to valid UTF-8, which is what you want anyway. And, you can do it even
    from the Windows system. If you prevent my solution, you will not have my
    program in the first place, meaning you will need to go to the UNIX system
    to rename the file, and that even in order to access it in the first place.

    Actually, there are two subflavors of my conversion possible (I can hear you
    say "oh, noooo"). One does escape the escapes, the other doesn't. This
    second flavor can be used by applications that need to make UTF-8 from an
    arbitrary input, but do not need to re-create the original byte sequence.
    Basically, they are preserving all the data, except for the information how
    many times the original invalid sequences were escaped. There may be a need
    for such applications and they would in fact reduce the re-escaping problem.

    > Which means that UTF-8 can't be replaced with your modification.
    > If they coexisted, expect trouble when the two slightly incompatible
    > encodings meet.
    Or, expect trouble when dealing with data that is not guaranteed to be
    UTF-8. Or hope that there will be no such data, in near future, and I mean

    > > Using my conversion, Windows can access any file on UNIX, because my
    > > conversion guarantees roundtrip UX=>Win=>UX
    > Well, with or without your conversion it's not true, because there
    > are various characters which are valid in Unix filenames but not in
    > Windows (e.g. ? * : \ and control characters). So if all filenames are
    > to be accessible, they have to introduce some escaping. And as soon
    > as an escaping scheme is used, it can be extended to encode isolated
    > bytes with high bit set.
    Good point. But you are assuming I copy the files to Windows filesystem. I
    don't. I have no problems if you specify your filename with any of the above
    characters, even from Windows.

    And, BTW, suppose UTF-8 validation is introduced (as an option) on UNIX
    filesystems. The characters you mention (and some other, I can tell you
    exactly which don't work on Windows) could again be (optionally) rejected on
    UNIX filesystems.

    > > Win=>UX=>Win roundtrip is not guaranteed.
    > Currently it breaks only for isolated surrogates (assuming the Unix
    > is configured to use UTF-8). If Windows filenames are specified to be
    > UTF-16, the error is clearly on the Windows side and this side should
    > be fixed.
    And in my case, it would break for some malicious sequences of the 128
    codepoints. Equally rare, and with equal minor consequences. Ummmm, and it
    can be fixed, too. Such malicious sequences could be forbidden in contexts
    where we fear they might cause problems.


    This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 10:32:12 CST