Re: [icu-support] complete binary/utf mapping

From: Doug Ewell (
Date: Thu Sep 06 2007 - 23:20:03 CDT

  • Next message: Sinnathurai Srivas: "Re: [indic] Re: Feedback on PR-104"

    I'll see if I can find the thread where we talked about that, years ago.

    Somebody wanted to build that capability into an extension to UTF-8, so
    it could faithfully represent invalid garbage. We were never able to
    get him to work through what he wanted to do with the garbage thus

    Doug Ewell · Fullerton, California, USA · RFC 4645 · UTN #14
    ----- Original Message ----- 
    From: Mark Davis
    To: ; ICU support mailing list ; Unicode
    Sent: Thursday, September 6, 2007 12:34
    Subject: Re: [icu-support] complete binary/utf mapping
    Ccing Unicode in case anyone knows.
    I don't know of any public ones. Years ago in ICU we tossed around the 
    idea of having something like that. It was roughly the following:
    Reserve 256 code points for "bytes that couldn't be converted"
    Reserve one code point for a "quote character"
    When converting from a source, say possibly mangled UTF-8, convert all 
    valid sequences normally, except that a quote character is inserted 
    before any of the 256 items above. Any invalid sequence is converted to 
    a sequence of the appropriate ones of the 256 code points. When 
    converting back, the quote character + following code point is converted 
    directly, and any other of the 256 are emitted as bytes. (The 257 code 
    points could be private use.)
    This would round-trip all bytes in a buffer between any single charset X 
    and Unicode. However, as soon as you get into a situation where you 
    could be outputting the resulting Unicode to a different charset Y, then 
    it looked like it started to break down. So it was little more than 
    lunch conversation.
    On 9/6/07, Steve Bush <> wrote:
    I read somewhere that there were some proposals to work out a lossless 
    scheme for round tripping binary (ie all illegal UTF bytes/sequences) to 
    UTF and back again.
    Can anyone point me in the direction of these efforts?
    Steve Bush
    NEOSYS Dubai.
    This email is sponsored by: Splunk Inc.
    Still grepping through log files to find problems?  Stop.
    Now Search log events and configuration files using AJAX and a browser.
    Download your FREE copy of Splunk now >>
    icu-support mailing list -
    To Un/Subscribe:

    This archive was generated by hypermail 2.1.5 : Thu Sep 06 2007 - 23:25:36 CDT