Re: Undefined code positions in 8-bit character sets

From: Mark Davis (mark.davis@icu-project.org)
Date: Mon May 05 2008 - 21:59:10 CDT

Next message: Otto Stolz: "(OT) The Importance of Getting the Casing Right (was: Freedom to Normalise)"

Previous message: David Starner: "Re: Undefined code positions in 8-bit character sets"
In reply to: David Starner: "Re: Undefined code positions in 8-bit character sets"
Next in thread: Andreas Prilop: "Re: Undefined code positions in 8-bit character sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

For the world of encoding converters, in ICU we always go by what
people do, and not what they say. Otherwise you always get into
trouble, and have incompatibilities. Windows itself maps 0x90 to
U+0090.

Mark

On Mon, May 5, 2008 at 7:02 PM, David Starner <prosfilaes@gmail.com> wrote:
> On Mon, May 5, 2008 at 9:19 PM, Kenneth Whistler <kenw@sybase.com> wrote:
> > A basic ISO 8859-1 <--> Unicode converter shouldn't be
> > stopping on an 0x90 byte, saying "hmmm, I wonder what this
> > is all about?" and flagging some exception for potentially
> > endless rumination by a heuristic algorithm before returning
> > a conversion.
> >
> > You basically have two choices:
> >
> > 0x90 --> U+0090
> >
> > or
> >
> > 0x90 --> U+FFFD
> >
> > and the first is what U+0090 was encoded for in the first place
> > and is what most commercial converters do, as far as I know.
>
> I don't disagree with that. But there's a difference between ISO
> 8859-1, which has a space between 0x80 and 0x9F basically for the C1
> controls, and Windows-1252, which has a collection of varied
> characters in that range. In Windows-1252, the spaces clearly aren't
> left open for C1 controls and are unusable as such; U+0090, when used
> as a C1 control, demands that the data following be terminated by a
> U+009C, which isn't in Windows-1252!
>
> Worse, to convert U+0090 to 0x90 is as wrong as converting 0x90 to
> U+0620; it's undefined what 0x90 means in Windows-1252, and what
> U+0090 does mean couldn't possibly fit into the Windows-1252 character
> set. To convert from Windows-1252 0x90 <-> U+0090 doesn't preserve the
> semantics of that codepoint in either character set.
>
>

-- 
Mark

Next message: Otto Stolz: "(OT) The Importance of Getting the Casing Right (was: Freedom to Normalise)"
Previous message: David Starner: "Re: Undefined code positions in 8-bit character sets"
In reply to: David Starner: "Re: Undefined code positions in 8-bit character sets"
Next in thread: Andreas Prilop: "Re: Undefined code positions in 8-bit character sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon May 05 2008 - 22:02:24 CDT