Re: Roundtripping Solved

From: Arcane Jill (
Date: Thu Dec 16 2004 - 07:12:17 CST

  • Next message: Lars Kristan: "RE: Roundtripping Solved"

    I don't think that that last sentence is true. f(), and its near-inverse,
    g(), do not claim to be UTFs, and are functions intended to be used only by
    one particular suite of applications. They are therefore nothing to do with
    Unicode or the UTC (... or even this list ! ). The fact that I defined f
    such that f(s) == utf8decode(s) for all valid UTF-8 streams s does not
    change the status of f() as a purely private-use function.

    These are the steps I see happening:
    (1) start with an arbitrary octet stream
    (2) "escape" it, using some function (which I have called f), to yield a
    valid UTF-8 stream.
    (3) allow normal Unicode functions round-trip this UTF-8 string through
    UTF-16 (one of Lars' requirements)
    (4) finally, "unescape" the UTF-8 using f's inverse function (which I called
    g) to restore the original octet stream

    The escape and unescape functions don't need to be approved by anyone. I'm
    not suggesting they should be part of any standard - they are merely a
    mechanism to ensure that step (3) will hold true.

    Lars's current implementation of this scheme is that his "f" "escapes" the
    binary octet 1bbbbbbb to 11101110 1011101b 10bbbbbb (or equivalently, byte x
    becomes the character U+EE00 + x). He is unhappy with this because
    characters in the range U+EE80 to U+EEFF might be found in real text. So you
    and I have, between us, suggested three alternative escaping functions, in
    an attempt to find an escape sequence with a vanishingly small probability
    of being found in real text. I'm not quite sure why Lars isn't happy with
    these suggestions - maybe his goal has still not been clearly stated - but
    either way, since nobody is proposing an amendment to UTFs, it surely isn't
    the business of the UTC.

    Hope I haven't misunderstood things completely. That would be /so/

    -----Original Message-----
    From: Peter Kirk []
    Sent: 16 December 2004 12:09
    To: Lars Kristan
    Cc: Arcane Jill; Unicode
    Subject: Re: Roundtripping Solved

    The only way round this is to break the functionality of g so that it
    does not correctly convert all valid UTF-16 strings to UTF-8. That will
    certainly be unacceptable to the UTC.

    This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 07:20:14 CST