Roundtripping Solved

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Wed Dec 15 2004 - 05:11:50 CST

  • Next message: Lars Kristan: "RE: UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtri pping in Unicode)"

    I followed (and understood) Lar's explanation as to why the NOT-xxxx
    solution wouldn't work for him. Shame really - but here's another bash at a
    solution, again without breaking the Unicode model. If I have understood
    this correctly, these are Lars' requirements:

    1) There exists a function, f(), which maps an arbitrary octet stream to a
    sequence of Unicode characters
    2) A required property of f() is that, if any substring of its input is
    valid UTF-8, then f() must convert that substring to the sequence of Unicode
    characters which would have been obtained by UTF-8 itself.
    3) There exists an inverse function, g(), such that g(a) == b if and only if
    f(b) == a.

    As Unicoders have pointed out, these goals appear to be mutually
    contradictory, unless we assume the following corrollory, which I shall call
    "requirement 4".

    4) A second required property of f() is that, if any octet of its input is
    not part of a valid UTF-8 substring, then f() must convert that octet to a
    Unicode character string /which cannot possibly appear in Unicode plain
    text/.

    It is for reasons of requirement (4) that Lars proposes the introduction of
    128 BMP codepoints. His intention is that they be marked as "reserved - do
    not use", so that requirement 4 is met. Naturally, this proposal has met
    with a lot of resistance, and almost certainly would never get approved by
    the UC. Therefore, I propose an alternative solution, as follows:

    DEFINITION - "f" is a function which maps an arbitrary octet stream to a
    sequence of Unicode characters, such that (1) any substring which happens to
    be valid UTF-8 is mapped to the sequence of Unicode characters which would
    have been produced by UTF-8, and (2) all remaining single octets, xx (with x
    necessarily such that 0x80 <= xx <= 0xFF) are each mapped to the sequence:
    { U+0C55E3, U+01ED7A, U+05FDCB, U+09C351, U+07E168, U+0BBC80, U+107C09,
    U+0BA458, U+064188, U+048375, U+08ACE0, U+031DEF, U+00xx } (I got those
    numbers from a true random number generator).

    OBSERVATION - Requirement (4) is not met absolutely, however, the
    probability of the UTF-8 encoding of this sequence occuring "accidently" at
    an arbitrary offset in an arbitrary octet stream is approximately one in
    2^384; the probability of its occuring in /plain text/ is even smaller. This
    means that if your application were capable of processing one terabyte of
    date per second, you would expect to encounter this sequence by accident
    once every 2^340 years. (For reference, the Universe is somewhere around
    2^13 years old). This means that requirement 4 is "effectively met", even if
    not actually met.

    DEFINITION - "g" is the inverse function of f. By the observation above, f
    is injective, not bijective, so in the event of ambiguity, the sequence {
    U+0C55E3, U+01ED7A, U+05FDCB, U+09C351, U+07E168, U+0BBC80, U+107C09,
    U+0BA458, U+064188, U+048375, U+08ACE0, U+031DEF, U+00xx }is /always/
    assumed to map to the single octet xx. The probability of this choice being
    wrong is as stated above.

    Now everything will work. Unicode is not broken. All UTFs are
    interchangeable as before; Lars's "escape aware" applications can use the
    functions f() and g() instead of UTF-8 transformations; all other Unicode
    applications will retain Lars's data uncorrupted, and he can "unescape" it
    (that is, apply function g()) at the appropriate time to recover the
    original data.

    That do?
    Jill



    This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 05:19:23 CST