Re: Roundtripping Solved

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Dec 16 2004 - 08:25:37 CST

  • Next message: Peter Kirk: "Re: Roundtripping Solved"

    On 16/12/2004 13:20, Lars Kristan wrote:

    > ...
    > > ... So there is a
    > > mathematically proved
    > > inconsistency in your requirements.
    >
    > This only proves that requirements cannot be met by a single
    > conversion pair. If they could be met, then such a conversion could be
    > used immediately for converting to and from UTF-8.
    >
    > However, requirements 1 and 2 are actually taken from Unicode
    > standard, they are not my requirements.
    >

    Well, let's clarify. The existing situation is:

    1. for all valid UTF-8 strings s8, UTF-16(s8) is a valid UTF-16 string
    and UTF-8(UTF-16(s8)) = s8
    2. for all valid UTF-16 strings s16, UTF-8(s16) is a valid UTF-8 string
    and UTF-16(UTF-8(s16)) = s16

    These standard definitions of UTF-8 and UTF-16 will not be changed, so
    don't even think about asking for this.

    Your requirement is a pair of functions f and g, such that:

    3. for all valid UTF-8 strings s8, f(s8) = UTF-16(s8)
    4. for all valid UTF-8 strings s8, g(f(s8)) = s8
    5. for all INVALID UTF-8 strings t8, f(t8) is a valid UTF-16 string and
    g(f(t8)) = t8

    The following is apparently NOT a requirement:

    6. for all valid UTF-16 strings s16, g(s16) = UTF-8(s16)

    But the note the following logical chain, all for all valid UTF-16
    strings s16:

    2 => s16 = UTF-16(UTF-8(s16))
    3 => s16 = f(UTF-8(s16))
    2 => UTF-8(s16) is a valid UTF-8 string, hence by 4 f(UTF-8(s16)) can be
    operated on by g
      => g(s16) = g(f(UTF-8(s16)))
    substituting UTF-8(s16) for s8:
    4 => g(s16) = UTF-8(s16)
    which proves 6.

    Hence the non-requirement is in fact a logical consequence of the
    requirements, and that is without even looking at requirement 5.

    Therefore 5 implies a contradiction. For any invalid UTF-8 string t8:

    5 => f(t8) is a valid UTF-16 string
    2 => UTF-8(f(t8)) is a valid UTF-8 string
    6 => g(f(t8)) (= UTF-8(f(t8)) ) is a valid UTF-8 string
    4 => t8 (= g(f(t8)) ) is a valid UTF-8 string

    But this contradicts the premise that t8 is an invalid UTF-8 string.

    > How's that? Well, they are my requirements also, but instead of "for
    > all valid UTF-x strings", in my case the requirement is relaxed to
    > "for all valid UTF-8 strings that do not contain the 128 replacement
    > codepoints".
    >

    So do you mean to relax the requirement "for all valid UTF-8 strings s8,
    f(s8) = UTF-16(s8)"? The problem with this is that it is broken by
    existing filenames which (probably by chance) form the UTF-8 for one of
    your 128 replacement codepoints. Well, there are not 128 replacement
    codepoints, and never will be, certainly not in the BMP - unless you are
    talking about unpaired surrogates or the PUA.
    ...

    > No, this is the most important requirement. The idea is to obtain a
    > VALID UTF-16 string. ...
    >

    Well, your requirements are logically contradictory. Sorry.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 10:58:15 CST