Re: Roundtripping Solved

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Dec 16 2004 - 06:08:48 CST

  • Next message: Doug Ewell: "Re: Roundtripping Solved"

    On 16/12/2004 11:36, Lars Kristan wrote:

    > ...
    >
    > > can use either U+FFFE or U+FFFF, which "are
    > > intended for process internal uses, but are not permitted for
    > > interchange." Let's call the one non-character chosen INVALID.
    > Can't. I DO want the resulting UTF-16 to be valid for interchange.
    > This is the whole purpose. And increasing the overhead is also not
    > desired.
    >
    >
    But this last requirement provides the proof that you can't have what
    you want.

    The current situation is:

    1. for all valid UTF-8 strings s8, f(s8) is a valid UTF-16 string and
    g(f(s8)) = s8
    2. for all valid UTF-16 strings s16, g(s16) is a valid UTF-8 string and
    f(g(s16)) = s16

    Your requirements are apparently:

    3. for all INVALID UTF-8 strings t8, f(t8) is a valid UTF-16 string and
    g(f(t8)) = t8

    But if f(t8) is a valid UTF-16 string, by rule 2 g(f(t8)) is a valid
    UTF-8 string, and by rule 3 g(f(t8)) = t8. But we have already stated
    that t8 is an INVALID UTF-8 string. So there is a mathematically proved
    inconsistency in your requirements.

    The only way round this is to break the functionality of g so that it
    does not correctly convert all valid UTF-16 strings to UTF-8. That will
    certainly be unacceptable to the UTC. The most you might get away with
    is a private function which does some non-standard conversion of PUA
    characters, but then you risk messing up PUA characters used by
    agreement between end users, or in filenames as UTF-8.

    Alternatively, you need to relax your requirement that f(t8) is a valid
    UTF-16 string, and instead allow that it can be a UTF-16-like string but
    including something invalid like a noncharacter or an unpaired
    surrogate. This will not be technically valid for interchange, of
    course. But my suggestion of using a noncharacter as an escape is a way
    in which this could be done.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 10:53:19 CST