Re: Roundtripping in Unicode

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Wed Dec 15 2004 - 07:33:34 CST

  • Next message: Lars Kristan: "RE: Roundtripping in Unicode"

    Lars Kristan <lars.kristan@hermes.si> writes:

    > Now, it is true that data from two applications using this technique can
    > become intermixed. But this is not something we should fear. On the
    > contrary, this is why I do what to standardize the approach. Because in most
    > cases what will happen is exactly what one expects. If each of the two
    > applications chose an arbitrary escaping technique to solve the problem,
    > then you get a bigger mess.

    If one application switches from standard UTF-8 to your modification,
    and another application continues to use standard UTF-8, then the
    ability to pass arbitrary Unicode strings between them by serializing
    them to UTF-8 is lost. So you can't claim that does not affect
    programs which don't adopt it. It would have to be adopted by all
    programs which currently use UTF-8, or data exchange would break.

    But it's not a viable replacement of UTF-8. Even if both applications
    use your modification, the ability to serialize arbitrary sequences
    of valid code points (i.e. not surrogates) through UTF-8 is lost: the
    mapping to modified UTF-8 is not injective.

    Which means that UTF-8 can't be replaced with your modification.
    If they coexisted, expect trouble when the two slightly incompatible
    encodings meet.

    The GNU implementation of Java treats filenames as Java-modified
    UTF-8. This is broken in two ways. First, it's not usable in an
    environment where filenames use e.g. ISO-8859-x. Next, it's not
    correct even in a purely UTF-8 environment, because it encodes
    characters above U+FFFF differently - it uses a *non-standard*
    modification of UTF-8. Balkanization of UTF-8 is bad.

    > Using my conversion, Windows can access any file on UNIX, because my
    > conversion guarantees roundtrip UX=>Win=>UX

    Well, with or without your conversion it's not true, because there
    are various characters which are valid in Unix filenames but not in
    Windows (e.g. ? * : \ and control characters). So if all filenames are
    to be accessible, they have to introduce some escaping. And as soon
    as an escaping scheme is used, it can be extended to encode isolated
    bytes with high bit set.

    > Win=>UX=>Win roundtrip is not guaranteed.

    Currently it breaks only for isolated surrogates (assuming the Unix
    is configured to use UTF-8). If Windows filenames are specified to be
    UTF-16, the error is clearly on the Windows side and this side should
    be fixed.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 07:43:10 CST