Re: Roundtripping Solved

From: Peter Kirk (
Date: Wed Dec 15 2004 - 06:54:04 CST

  • Next message: Doug Ewell: "Re: Roundtripping Solved"

    On 15/12/2004 11:11, Arcane Jill wrote:

    > I followed (and understood) Lar's explanation as to why the NOT-xxxx
    > solution wouldn't work for him. Shame really - but here's another bash
    > at a solution, again without breaking the Unicode model. If I have
    > understood this correctly, these are Lars' requirements:
    > 1) There exists a function, f(), which maps an arbitrary octet stream
    > to a sequence of Unicode characters
    > 2) A required property of f() is that, if any substring of its input
    > is valid UTF-8, then f() must convert that substring to the sequence
    > of Unicode characters which would have been obtained by UTF-8 itself.
    > 3) There exists an inverse function, g(), such that g(a) == b if and
    > only if f(b) == a.

    Lars seems to have extended the requirement here such that a can be any
    sequence of 16-bit words, just as b can be any sequence of octets, i.e.
    he requires not only that g(f(b)) == b for all b, but also that f(g(a))
    == a for all a. That may makes things much harder! There is at least a
    need to deal with unpaired surrogates.

    > As Unicoders have pointed out, these goals appear to be mutually
    > contradictory, unless we assume the following corrollory, which I
    > shall call "requirement 4".
    > 4) A second required property of f() is that, if any octet of its
    > input is not part of a valid UTF-8 substring, then f() must convert
    > that octet to a Unicode character string /which cannot possibly appear
    > in Unicode plain text/.
    > It is for reasons of requirement (4) that Lars proposes the
    > introduction of 128 BMP codepoints. His intention is that they be
    > marked as "reserved - do not use", so that requirement 4 is met.
    > Naturally, this proposal has met with a lot of resistance, and almost
    > certainly would never get approved by the UC. Therefore, I propose an
    > alternative solution, as follows:
    > ...
    > Now everything will work. Unicode is not broken. All UTFs are
    > interchangeable as before; Lars's "escape aware" applications can use
    > the functions f() and g() instead of UTF-8 transformations; all other
    > Unicode applications will retain Lars's data uncorrupted, and he can
    > "unescape" it (that is, apply function g()) at the appropriate time to
    > recover the original data.
    > That do?
    > Jill
    Jill, again your solution is ingenious. But would it not work just as
    well to for Lars' purposes to use, instead of your string of random
    characters, just ONE reserved code point followed by U+0xx? Instead of
    asking the UTC to allocate a specific code point for this (which it
    probably will not do), he can use either U+FFFE or U+FFFF, which "are
    intended for process internal uses, but are not permitted for
    interchange." Let's call the one non-character chosen INVALID.

    Of course a problem arises if the original filename consists of a string
    which is the UTF-8 representation of INVALID. Does this in fact count as
    valid UTF-8? (If it does, an alternative might be to use an unpaired
    surrogate for INVALID, because the UTF-8 representation of a surrogate
    is invalid UTF-8.) Even if it does, it does not represent valid Unicode,
    and so the conversion routine can convert the UTF-8 for INVALID as if it
    was three invalid bytes.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 10:45:57 CST