From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Dec 16 2004 - 06:08:48 CST
On 16/12/2004 11:36, Lars Kristan wrote:
> ...
>
> > can use either U+FFFE or U+FFFF, which "are
> > intended for process internal uses, but are not permitted for
> > interchange." Let's call the one non-character chosen INVALID.
> Can't. I DO want the resulting UTF-16 to be valid for interchange.
> This is the whole purpose. And increasing the overhead is also not
> desired.
>
>
But this last requirement provides the proof that you can't have what
you want.
The current situation is:
1. for all valid UTF-8 strings s8, f(s8) is a valid UTF-16 string and
g(f(s8)) = s8
2. for all valid UTF-16 strings s16, g(s16) is a valid UTF-8 string and
f(g(s16)) = s16
Your requirements are apparently:
3. for all INVALID UTF-8 strings t8, f(t8) is a valid UTF-16 string and
g(f(t8)) = t8
But if f(t8) is a valid UTF-16 string, by rule 2 g(f(t8)) is a valid
UTF-8 string, and by rule 3 g(f(t8)) = t8. But we have already stated
that t8 is an INVALID UTF-8 string. So there is a mathematically proved
inconsistency in your requirements.
The only way round this is to break the functionality of g so that it
does not correctly convert all valid UTF-16 strings to UTF-8. That will
certainly be unacceptable to the UTC. The most you might get away with
is a private function which does some non-standard conversion of PUA
characters, but then you risk messing up PUA characters used by
agreement between end users, or in filenames as UTF-8.
Alternatively, you need to relax your requirement that f(t8) is a valid
UTF-16 string, and instead allow that it can be a UTF-16-like string but
including something invalid like a noncharacter or an unpaired
surrogate. This will not be technically valid for interchange, of
course. But my suggestion of using a noncharacter as an escape is a way
in which this could be done.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 10:53:19 CST