Doug Ewell
Thu Sep 06 2007

    I'll see if I can find the thread where we talked about that, years ago.

    Somebody wanted to build that capability into an extension to UTF-8, so
    it could faithfully represent invalid garbage. We were never able to
    get him to work through what he wanted to do with the garbage thus

    Ccing Unicode in case anyone knows.
    I don't know of any public ones. Years ago in ICU we tossed around the 
    idea of having something like that. It was roughly the following:
    Reserve 256 code points for "bytes that couldn't be converted"
    Reserve one code point for a "quote character"
    When converting from a source, say possibly mangled UTF-8, convert all 
    valid sequences normally, except that a quote character is inserted 
    before any of the 256 items above. Any invalid sequence is converted to 
    a sequence of the appropriate ones of the 256 code points. When 
    converting back, the quote character + following code point is converted 
    directly, and any other of the 256 are emitted as bytes. (The 257 code 
    points could be private use.)
    This would round-trip all bytes in a buffer between any single charset X 
    and Unicode. However, as soon as you get into a situation where you 
    could be outputting the resulting Unicode to a different charset Y, then 
    it looked like it started to break down. So it was little more than 
    lunch conversation.
    Steve Bush
    I read somewhere that there were some proposals to work out a lossless 
    scheme for round tripping binary (ie all illegal UTF bytes/sequences) to 
    UTF and back again.
    Can anyone point me in the direction of these efforts?
    Steve Bush
    NEOSYS Dubai.
