Re: Roundtripping in Unicode

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Dec 13 2004 - 17:56:21 CST

Next message: Doug Ewell: "Re: RE: Roundtripping in Unicode"

Previous message: Philippe Verdy: "Re: Nicest UTF"
In reply to: Mark Davis: "Re: Roundtripping in Unicode"
Next in thread: Lars Kristan: "RE: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

That's exactly the same response and idea as Ken I gave to Lars, for the
case where he wants valid codepoints (but I also argued that this was not
offering roundtripping, only a better substitution than U+FFFD, i.e. this
conversion is not completely lossless, given that those private conventions
for substitutions would become not different from legal input with no
encoding error:

If you convert invalid input bytes nn to U+EEnn, then you can't reverse
U+EEnn back to bytes nn without also converting correctly encoded U+EEnn
that would have been present on the original input stream.

So I don't call that "roundtripping" (the conversion is not fully
bijective), but "substitution" as this conversion CANNOT be safely reversed.
Such substituion is one-way only.

The only way to perform roundtripping of invalid input bytes to internal
code units, is to convert these bytes to invalid sequences of code units for
internal processing. This way you are certain that internal processing code
units (even if they are invalid) will not be equal to other valid internal
code units that could be reversed illegally to invalid output bytes (doing
so would!

So if an input can contain invalid bytes in the UTF-8 stream, these bytes
must be converted (if full roundtripping is needed) to invalid sequences of
code units (with an extended UTF-16 internal processing, one can use 0xFFFE
and 0xFFFF as markers before an isolated trailing surrogate; with an
extended UTF-16 internal processing, one can use code units above 0x10FFFF).
Doing this does not even require any private agreement.

Same thing if processing UTF-16BE or UTF16-LE input streams with invalid
byte sequences: the internal processing can be performed in UTF-8 or UTF-32
using invalid sequences of 8-bit or 32-bit code units.

----- Original Message -----
From: "Mark Davis" <mark.davis@jtcsv.com>
To: "Kenneth Whistler" <kenw@sybase.com>; <lars.kristan@hermes.si>
Cc: <unicode@unicode.org>
Sent: Monday, December 13, 2004 11:04 PM
Subject: Re: Roundtripping in Unicode

> Ken is absolutely right. It would be theoretically possible to add 128
> code
> points that would allow one to roundtrip a bytestream after passing
> through
> a UTF-8 <=> UTF-32 conversion. (For that matter, it would be possible to
> add
> 2048 code points that would allow the same for a 16-bit data stream.)

Next message: Doug Ewell: "Re: RE: Roundtripping in Unicode"
Previous message: Philippe Verdy: "Re: Nicest UTF"
In reply to: Mark Davis: "Re: Roundtripping in Unicode"
Next in thread: Lars Kristan: "RE: Roundtripping in Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 17:58:49 CST