Re: Undefined code positions in 8-bit character sets

From: Mark Davis (mark.davis@icu-project.org)
Date: Mon May 05 2008 - 21:59:10 CDT

  • Next message: Otto Stolz: "(OT) The Importance of Getting the Casing Right (was: Freedom to Normalise)"

    For the world of encoding converters, in ICU we always go by what
    people do, and not what they say. Otherwise you always get into
    trouble, and have incompatibilities. Windows itself maps 0x90 to
    U+0090.

    Mark

    On Mon, May 5, 2008 at 7:02 PM, David Starner <prosfilaes@gmail.com> wrote:
    > On Mon, May 5, 2008 at 9:19 PM, Kenneth Whistler <kenw@sybase.com> wrote:
    > > A basic ISO 8859-1 <--> Unicode converter shouldn't be
    > > stopping on an 0x90 byte, saying "hmmm, I wonder what this
    > > is all about?" and flagging some exception for potentially
    > > endless rumination by a heuristic algorithm before returning
    > > a conversion.
    > >
    > > You basically have two choices:
    > >
    > > 0x90 --> U+0090
    > >
    > > or
    > >
    > > 0x90 --> U+FFFD
    > >
    > > and the first is what U+0090 was encoded for in the first place
    > > and is what most commercial converters do, as far as I know.
    >
    > I don't disagree with that. But there's a difference between ISO
    > 8859-1, which has a space between 0x80 and 0x9F basically for the C1
    > controls, and Windows-1252, which has a collection of varied
    > characters in that range. In Windows-1252, the spaces clearly aren't
    > left open for C1 controls and are unusable as such; U+0090, when used
    > as a C1 control, demands that the data following be terminated by a
    > U+009C, which isn't in Windows-1252!
    >
    > Worse, to convert U+0090 to 0x90 is as wrong as converting 0x90 to
    > U+0620; it's undefined what 0x90 means in Windows-1252, and what
    > U+0090 does mean couldn't possibly fit into the Windows-1252 character
    > set. To convert from Windows-1252 0x90 <-> U+0090 doesn't preserve the
    > semantics of that codepoint in either character set.
    >
    >

    -- 
    Mark
    


    This archive was generated by hypermail 2.1.5 : Mon May 05 2008 - 22:02:24 CDT