From: Lars Kristan (email@example.com)
Date: Thu Dec 16 2004 - 05:36:37 CST
Peter Kirk wrote in response to Arcane Jill:
> > 3) There exists an inverse function, g(), such that g(a) ==
> b if and
> > only if f(b) == a.
> Lars seems to have extended the requirement here such that a
> can be any
> sequence of 16-bit words, just as b can be any sequence of
> octets, i.e.
> he requires not only that g(f(b)) == b for all b, but also
> that f(g(a))
> == a for all a. That may makes things much harder! There is
> at least a
> need to deal with unpaired surrogates.
I should have analyzed Jill's mail more carefully. This must be a
My requirement is that g(f(b))=b, which is NON-UTF-8 => UTF-16 => NON-UTF-8.
However, f(g(a))=a was not my requirement. I even assert the two cannot be
achieved at the same time.
If the two requirements could be met at the same time, there would be no
problem and everybody would accept the solution since meeting f(g(a))=a
keeps all Unicoders happy.
There are other requirements, or at least wishes. And one is that f(a) for a
single byte should be a single BMP codepoint.
I think devising new algorithms will not help. What would be useful would be
a proof that my algorithm doesn't break the rules of Unicode. OK, it does.
So, try again: What would be useful would be a proof that my algorithm
doesn't break the *functionality* that Unicode rules provide.
> can use either U+FFFE or U+FFFF, which "are
> intended for process internal uses, but are not permitted for
> interchange." Let's call the one non-character chosen INVALID.
Can't. I DO want the resulting UTF-16 to be valid for interchange. This is
the whole purpose. And increasing the overhead is also not desired.
This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 05:42:45 CST