RE: Roundtripping Solved

From: Lars Kristan (
Date: Thu Dec 16 2004 - 07:20:24 CST

  • Next message: Arcane Jill: "RE: Roundtripping Solved"

    Peter Kirk wrote:
    > But this last requirement provides the proof that you can't have what
    > you want.
    > The current situation is:
    > 1. for all valid UTF-8 strings s8, f(s8) is a valid UTF-16 string and
    > g(f(s8)) = s8
    > 2. for all valid UTF-16 strings s16, g(s16) is a valid UTF-8
    > string and
    > f(g(s16)) = s16
    > Your requirements are apparently:
    > 3. for all INVALID UTF-8 strings t8, f(t8) is a valid UTF-16
    > string and
    > g(f(t8)) = t8
    > But if f(t8) is a valid UTF-16 string, by rule 2 g(f(t8)) is a valid
    > UTF-8 string, and by rule 3 g(f(t8)) = t8. But we have already stated
    > that t8 is an INVALID UTF-8 string. So there is a
    > mathematically proved
    > inconsistency in your requirements.

    This only proves that requirements cannot be met by a single conversion
    pair. If they could be met, then such a conversion could be used immediately
    for converting to and from UTF-8.

    However, requirements 1 and 2 are actually taken from Unicode standard, they
    are not my requirements.

    How's that? Well, they are my requirements also, but instead of "for all
    valid UTF-x strings", in my case the requirement is relaxed to "for all
    valid UTF-8 strings that do not contain the 128 replacement codepoints".

    > The only way round this is to break the functionality of g so that it
    > does not correctly convert all valid UTF-16 strings to UTF-8.
    > That will
    > certainly be unacceptable to the UTC.
    Why not? It does not claim to produce UTF-8 and is not intended to. f(x) is
    used on "unclean,-not-really-binary,-but-mostly-UTF-8" data. And g(y)
    produces such data.

    g(f(x)) is very useful. It preserves all the data and rountrips.
    f(g(y)) is not problematic. It behaves like UTF16(UTF8(s16)) for all
    codepoints except the infamous 128. Which is acceptable in my case. Or,
    well, it would be if everyone agreed what those 128 codepoints are and what
    is their purpose.

    Even more, f(x) only produces sequences of the 128 codepoints for which
    f(g(y))=y is actually true.

    Furthermore, today, y should not contain any of the 128 codepoints (assuming
    UTC takes unassigned codepoints and assigns them today). Any occurences
    after today shall be interpreted according to their intended meaning.

    Sequences for which f(g(y)) is NOT y, can be declared as invalid sequences.
    Applications dealing with security could reject them. For the rest, anything
    that happens will only be amusing, rarely confusing, never dangerous. No
    more than any other escaping technique. And considerably less than inability
    to access files or even files being displayed with missing characters (or no
    characters at all).

    > Alternatively, you need to relax your requirement that f(t8)
    > is a valid
    > UTF-16 string, and instead allow that it can be a UTF-16-like
    > string but
    > including something invalid like a noncharacter or an unpaired
    > surrogate. This will not be technically valid for interchange, of
    > course. But my suggestion of using a noncharacter as an
    > escape is a way
    > in which this could be done.

    No, this is the most important requirement. The idea is to obtain a VALID
    UTF-16 string. Interchange is vital. Otherwise I cannot even use a Unicode
    database to store them. Obtaining a semi-valid string achieves nothing.
    Might as well stick with the original 'binary' stream (well,
    8-bit-opaque-nul-terminated-string). Which is terribly impractical.


    This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 07:27:19 CST