Re: Roundtripping Solved

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Dec 16 2004 - 06:08:48 CST

Next message: Doug Ewell: "Re: Roundtripping Solved"

Previous message: Lars Kristan: "RE: Roundtripping Solved"
In reply to: Lars Kristan: "RE: Roundtripping Solved"
Next in thread: Lars Kristan: "RE: Roundtripping Solved"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 16/12/2004 11:36, Lars Kristan wrote:

> ...
>
> > can use either U+FFFE or U+FFFF, which "are
> > intended for process internal uses, but are not permitted for
> > interchange." Let's call the one non-character chosen INVALID.
> Can't. I DO want the resulting UTF-16 to be valid for interchange.
> This is the whole purpose. And increasing the overhead is also not
> desired.
>
>
But this last requirement provides the proof that you can't have what
you want.

The current situation is:

1. for all valid UTF-8 strings s8, f(s8) is a valid UTF-16 string and
g(f(s8)) = s8
2. for all valid UTF-16 strings s16, g(s16) is a valid UTF-8 string and
f(g(s16)) = s16

Your requirements are apparently:

3. for all INVALID UTF-8 strings t8, f(t8) is a valid UTF-16 string and
g(f(t8)) = t8

But if f(t8) is a valid UTF-16 string, by rule 2 g(f(t8)) is a valid
UTF-8 string, and by rule 3 g(f(t8)) = t8. But we have already stated
that t8 is an INVALID UTF-8 string. So there is a mathematically proved
inconsistency in your requirements.

The only way round this is to break the functionality of g so that it
does not correctly convert all valid UTF-16 strings to UTF-8. That will
certainly be unacceptable to the UTC. The most you might get away with
is a private function which does some non-standard conversion of PUA
characters, but then you risk messing up PUA characters used by
agreement between end users, or in filenames as UTF-8.

Alternatively, you need to relax your requirement that f(t8) is a valid
UTF-16 string, and instead allow that it can be a UTF-16-like string but
including something invalid like a noncharacter or an unpaired
surrogate. This will not be technically valid for interchange, of
course. But my suggestion of using a noncharacter as an escape is a way
in which this could be done.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Doug Ewell: "Re: Roundtripping Solved"
Previous message: Lars Kristan: "RE: Roundtripping Solved"
In reply to: Lars Kristan: "RE: Roundtripping Solved"
Next in thread: Lars Kristan: "RE: Roundtripping Solved"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 10:53:19 CST