From: Arcane Jill (email@example.com)
Date: Thu Dec 16 2004 - 07:12:17 CST
I don't think that that last sentence is true. f(), and its near-inverse,
g(), do not claim to be UTFs, and are functions intended to be used only by
one particular suite of applications. They are therefore nothing to do with
Unicode or the UTC (... or even this list ! ). The fact that I defined f
such that f(s) == utf8decode(s) for all valid UTF-8 streams s does not
change the status of f() as a purely private-use function.
These are the steps I see happening:
(1) start with an arbitrary octet stream
(2) "escape" it, using some function (which I have called f), to yield a
valid UTF-8 stream.
(3) allow normal Unicode functions round-trip this UTF-8 string through
UTF-16 (one of Lars' requirements)
(4) finally, "unescape" the UTF-8 using f's inverse function (which I called
g) to restore the original octet stream
The escape and unescape functions don't need to be approved by anyone. I'm
not suggesting they should be part of any standard - they are merely a
mechanism to ensure that step (3) will hold true.
Lars's current implementation of this scheme is that his "f" "escapes" the
binary octet 1bbbbbbb to 11101110 1011101b 10bbbbbb (or equivalently, byte x
becomes the character U+EE00 + x). He is unhappy with this because
characters in the range U+EE80 to U+EEFF might be found in real text. So you
and I have, between us, suggested three alternative escaping functions, in
an attempt to find an escape sequence with a vanishingly small probability
of being found in real text. I'm not quite sure why Lars isn't happy with
these suggestions - maybe his goal has still not been clearly stated - but
either way, since nobody is proposing an amendment to UTFs, it surely isn't
the business of the UTC.
Hope I haven't misunderstood things completely. That would be /so/
From: Peter Kirk [mailto:firstname.lastname@example.org]
Sent: 16 December 2004 12:09
To: Lars Kristan
Cc: Arcane Jill; Unicode
Subject: Re: Roundtripping Solved
The only way round this is to break the functionality of g so that it
does not correctly convert all valid UTF-16 strings to UTF-8. That will
certainly be unacceptable to the UTC.
This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 07:20:14 CST