From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Dec 16 2004 - 08:25:37 CST
On 16/12/2004 13:20, Lars Kristan wrote:
> ...
> > ... So there is a
> > mathematically proved
> > inconsistency in your requirements.
>
> This only proves that requirements cannot be met by a single
> conversion pair. If they could be met, then such a conversion could be
> used immediately for converting to and from UTF-8.
>
> However, requirements 1 and 2 are actually taken from Unicode
> standard, they are not my requirements.
>
Well, let's clarify. The existing situation is:
1. for all valid UTF-8 strings s8, UTF-16(s8) is a valid UTF-16 string
and UTF-8(UTF-16(s8)) = s8
2. for all valid UTF-16 strings s16, UTF-8(s16) is a valid UTF-8 string
and UTF-16(UTF-8(s16)) = s16
These standard definitions of UTF-8 and UTF-16 will not be changed, so
don't even think about asking for this.
Your requirement is a pair of functions f and g, such that:
3. for all valid UTF-8 strings s8, f(s8) = UTF-16(s8)
4. for all valid UTF-8 strings s8, g(f(s8)) = s8
5. for all INVALID UTF-8 strings t8, f(t8) is a valid UTF-16 string and
g(f(t8)) = t8
The following is apparently NOT a requirement:
6. for all valid UTF-16 strings s16, g(s16) = UTF-8(s16)
But the note the following logical chain, all for all valid UTF-16
strings s16:
2 => s16 = UTF-16(UTF-8(s16))
3 => s16 = f(UTF-8(s16))
2 => UTF-8(s16) is a valid UTF-8 string, hence by 4 f(UTF-8(s16)) can be
operated on by g
=> g(s16) = g(f(UTF-8(s16)))
substituting UTF-8(s16) for s8:
4 => g(s16) = UTF-8(s16)
which proves 6.
Hence the non-requirement is in fact a logical consequence of the
requirements, and that is without even looking at requirement 5.
Therefore 5 implies a contradiction. For any invalid UTF-8 string t8:
5 => f(t8) is a valid UTF-16 string
2 => UTF-8(f(t8)) is a valid UTF-8 string
6 => g(f(t8)) (= UTF-8(f(t8)) ) is a valid UTF-8 string
4 => t8 (= g(f(t8)) ) is a valid UTF-8 string
But this contradicts the premise that t8 is an invalid UTF-8 string.
> How's that? Well, they are my requirements also, but instead of "for
> all valid UTF-x strings", in my case the requirement is relaxed to
> "for all valid UTF-8 strings that do not contain the 128 replacement
> codepoints".
>
So do you mean to relax the requirement "for all valid UTF-8 strings s8,
f(s8) = UTF-16(s8)"? The problem with this is that it is broken by
existing filenames which (probably by chance) form the UTF-8 for one of
your 128 replacement codepoints. Well, there are not 128 replacement
codepoints, and never will be, certainly not in the BMP - unless you are
talking about unpaired surrogates or the PUA.
...
> No, this is the most important requirement. The idea is to obtain a
> VALID UTF-16 string. ...
>
Well, your requirements are logically contradictory. Sorry.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 10:58:15 CST