# Re: Roundtripping Solved

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Dec 16 2004 - 08:25:37 CST

• Next message: Peter Kirk: "Re: Roundtripping Solved"

On 16/12/2004 13:20, Lars Kristan wrote:

> ...
> > ... So there is a
> > mathematically proved
> > inconsistency in your requirements.
>
> This only proves that requirements cannot be met by a single
> conversion pair. If they could be met, then such a conversion could be
> used immediately for converting to and from UTF-8.
>
> However, requirements 1 and 2 are actually taken from Unicode
> standard, they are not my requirements.
>

Well, let's clarify. The existing situation is:

1. for all valid UTF-8 strings s8, UTF-16(s8) is a valid UTF-16 string
and UTF-8(UTF-16(s8)) = s8
2. for all valid UTF-16 strings s16, UTF-8(s16) is a valid UTF-8 string
and UTF-16(UTF-8(s16)) = s16

These standard definitions of UTF-8 and UTF-16 will not be changed, so

Your requirement is a pair of functions f and g, such that:

3. for all valid UTF-8 strings s8, f(s8) = UTF-16(s8)
4. for all valid UTF-8 strings s8, g(f(s8)) = s8
5. for all INVALID UTF-8 strings t8, f(t8) is a valid UTF-16 string and
g(f(t8)) = t8

The following is apparently NOT a requirement:

6. for all valid UTF-16 strings s16, g(s16) = UTF-8(s16)

But the note the following logical chain, all for all valid UTF-16
strings s16:

2 => s16 = UTF-16(UTF-8(s16))
3 => s16 = f(UTF-8(s16))
2 => UTF-8(s16) is a valid UTF-8 string, hence by 4 f(UTF-8(s16)) can be
operated on by g
=> g(s16) = g(f(UTF-8(s16)))
substituting UTF-8(s16) for s8:
4 => g(s16) = UTF-8(s16)
which proves 6.

Hence the non-requirement is in fact a logical consequence of the
requirements, and that is without even looking at requirement 5.

Therefore 5 implies a contradiction. For any invalid UTF-8 string t8:

5 => f(t8) is a valid UTF-16 string
2 => UTF-8(f(t8)) is a valid UTF-8 string
6 => g(f(t8)) (= UTF-8(f(t8)) ) is a valid UTF-8 string
4 => t8 (= g(f(t8)) ) is a valid UTF-8 string

But this contradicts the premise that t8 is an invalid UTF-8 string.

> How's that? Well, they are my requirements also, but instead of "for
> all valid UTF-x strings", in my case the requirement is relaxed to
> "for all valid UTF-8 strings that do not contain the 128 replacement
> codepoints".
>

So do you mean to relax the requirement "for all valid UTF-8 strings s8,
f(s8) = UTF-16(s8)"? The problem with this is that it is broken by
existing filenames which (probably by chance) form the UTF-8 for one of
your 128 replacement codepoints. Well, there are not 128 replacement
codepoints, and never will be, certainly not in the BMP - unless you are
talking about unpaired surrogates or the PUA.
...

> No, this is the most important requirement. The idea is to obtain a
> VALID UTF-16 string. ...
>

```--