Re: Roundtripping in Unicode

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Dec 13 2004 - 17:56:21 CST

  • Next message: Doug Ewell: "Re: RE: Roundtripping in Unicode"

    That's exactly the same response and idea as Ken I gave to Lars, for the
    case where he wants valid codepoints (but I also argued that this was not
    offering roundtripping, only a better substitution than U+FFFD, i.e. this
    conversion is not completely lossless, given that those private conventions
    for substitutions would become not different from legal input with no
    encoding error:

    If you convert invalid input bytes nn to U+EEnn, then you can't reverse
    U+EEnn back to bytes nn without also converting correctly encoded U+EEnn
    that would have been present on the original input stream.

    So I don't call that "roundtripping" (the conversion is not fully
    bijective), but "substitution" as this conversion CANNOT be safely reversed.
    Such substituion is one-way only.

    The only way to perform roundtripping of invalid input bytes to internal
    code units, is to convert these bytes to invalid sequences of code units for
    internal processing. This way you are certain that internal processing code
    units (even if they are invalid) will not be equal to other valid internal
    code units that could be reversed illegally to invalid output bytes (doing
    so would!

    So if an input can contain invalid bytes in the UTF-8 stream, these bytes
    must be converted (if full roundtripping is needed) to invalid sequences of
    code units (with an extended UTF-16 internal processing, one can use 0xFFFE
    and 0xFFFF as markers before an isolated trailing surrogate; with an
    extended UTF-16 internal processing, one can use code units above 0x10FFFF).
    Doing this does not even require any private agreement.

    Same thing if processing UTF-16BE or UTF16-LE input streams with invalid
    byte sequences: the internal processing can be performed in UTF-8 or UTF-32
    using invalid sequences of 8-bit or 32-bit code units.

    ----- Original Message -----
    From: "Mark Davis" <mark.davis@jtcsv.com>
    To: "Kenneth Whistler" <kenw@sybase.com>; <lars.kristan@hermes.si>
    Cc: <unicode@unicode.org>
    Sent: Monday, December 13, 2004 11:04 PM
    Subject: Re: Roundtripping in Unicode

    > Ken is absolutely right. It would be theoretically possible to add 128
    > code
    > points that would allow one to roundtrip a bytestream after passing
    > through
    > a UTF-8 <=> UTF-32 conversion. (For that matter, it would be possible to
    > add
    > 2048 code points that would allow the same for a 16-bit data stream.)



    This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 17:58:49 CST