Re: Roundtripping Solved

From: Peter Kirk (
Date: Thu Dec 16 2004 - 09:24:38 CST

  • Next message: Mike Ayers: "RE: Roundtripping Solved"

    On 16/12/2004 13:12, Arcane Jill wrote:

    > I don't think that that last sentence is true. f(), and its
    > near-inverse, g(), do not claim to be UTFs, and are functions intended
    > to be used only by one particular suite of applications. They are
    > therefore nothing to do with Unicode or the UTC (... or even this list
    > ! ). ...

    But Lars is continuing to insist on 128 reserved characters in the BMP.
    That is relevant to the UTC.

    He now seems to want to take them from the Yi Extensions block, and
    seems to be prepared to take the risk of being assassinated by the Yi,
    although not by other nations. Well, I don't know much about the Yi, but
    I did find "The Yi have long been known as fierce warriors." They are
    not a dead people who can't fight back against being pushed out of the
    BMP. And no doubt Michael Everson will also fight fiercely for the Yi
    Extensions block. So, be careful, Lars!

    > ...The fact that I defined f such that f(s) == utf8decode(s) for all
    > valid UTF-8 streams s does not change the status of f() as a purely
    > private-use function.
    > These are the steps I see happening:
    > (1) start with an arbitrary octet stream
    > (2) "escape" it, using some function (which I have called f), to yield
    > a valid UTF-8 stream.
    > (3) allow normal Unicode functions round-trip this UTF-8 string
    > through UTF-16 (one of Lars' requirements)
    > (4) finally, "unescape" the UTF-8 using f's inverse function (which I
    > called g) to restore the original octet stream
    > The escape and unescape functions don't need to be approved by anyone.
    > I'm not suggesting they should be part of any standard - they are
    > merely a mechanism to ensure that step (3) will hold true.
    These mechanisms, and any escape mechanism, do not meet the requirement
    which I codified as "for all valid UTF-8 strings s8, f(s8) =
    UTF-16(s8)". If this is not in fact a requirement, your mechanism can be
    made to work, and my logical proof against it fails. But perhaps this is
    what Lars means by "They don't translate as UTF-8 would to UTF-16": his
    reserved characters would be an exception to "for all valid UTF-8
    strings s8, f(s8) = UTF-16(s8)". In principle this is a way ahead.

    In what follows, I presume that this is still a requirement.

    > Lars's current implementation of this scheme is that his "f" "escapes"
    > the binary octet 1bbbbbbb to 11101110 1011101b 10bbbbbb (or
    > equivalently, byte x becomes the character U+EE00 + x). He is unhappy
    > with this because characters in the range U+EE80 to U+EEFF might be
    > found in real text. So you and I have, between us, suggested three
    > alternative escaping functions, in an attempt to find an escape
    > sequence with a vanishingly small probability of being found in real
    > text. I'm not quite sure why Lars isn't happy with these suggestions -
    > maybe his goal has still not been clearly stated - but either way,
    > since nobody is proposing an amendment to UTFs, it surely isn't the
    > business of the UTC.

    The problem can be restated quite simply. Valid UTF-8 has a reversible
    one-to-one mapping to valid Unicode character sequence, and to valid
    UTF-16. If there is a mapping from an "invalid UTF-8" string to a valid
    Unicode character sequence, there is also a mapping to that sequence
    from a valid UTF-8 string. The mapping "f" is no longer one-to-one but
    many-to-one. This implies that there cannot be a reverse mapping "g".
    Lars is rightly dissatisfied with any solution which does not guarantee

    I note that this argument applies equally to Lars' favoured solution of
    128 special characters. If these are valid Unicode characters, they have
    a valid UTF-8 representation. Both this representation and the isolated
    bytes will be converted by "f" to the same Unicode characters. This
    means that "f" is still not one-to-one and so irreversible. That is,
    unless Lars is actually proposing a change to the standard UTF-8 mapping
    for these characters. And if he is, that is certainly a matter for the
    UTC. Or of course if he is abandoning "for all valid UTF-8 strings s8,
    f(s8) = UTF-16(s8)".

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 11:04:08 CST