Re: [icu-support] complete binary/utf mapping

From: Mark Davis (
Date: Thu Sep 06 2007 - 14:34:13 CDT

  • Next message: Kenneth Whistler: "Re: Where is the First> Last> convention documented?"

    Ccing Unicode in case anyone knows.

    I don't know of any public ones. Years ago in ICU we tossed around the idea
    of having something like that. It was roughly the following:

       - Reserve 256 code points for "bytes that couldn't be converted"
       - Reserve one code point for a "quote character"

    When converting from a source, say possibly mangled UTF-8, convert all valid
    sequences normally, except that a quote character is inserted before any of
    the 256 items above. Any invalid sequence is converted to a sequence of the
    appropriate ones of the 256 code points. When converting back, the quote
    character + following code point is converted directly, and any other of the
    256 are emitted as bytes. (The 257 code points could be private use.)

    This would round-trip all bytes in a buffer between any single charset X and
    Unicode. However, as soon as you get into a situation where you could be
    outputting the resulting Unicode to a different charset Y, then it looked
    like it started to break down. So it was little more than lunch


    On 9/6/07, Steve Bush <> wrote:
    > I read somewhere that there were some proposals to work out a lossless
    > scheme for round tripping binary (ie all illegal UTF bytes/sequences) to UTF
    > and back again.
    > Can anyone point me in the direction of these efforts?
    > Steve Bush
    > NEOSYS Dubai.
    > -------------------------------------------------------------------------
    > This email is sponsored by: Splunk Inc.
    > Still grepping through log files to find problems? Stop.
    > Now Search log events and configuration files using AJAX and a browser.
    > Download your FREE copy of Splunk now >>
    > _______________________________________________
    > icu-support mailing list -
    > To Un/Subscribe:


    This archive was generated by hypermail 2.1.5 : Thu Sep 06 2007 - 14:38:05 CDT