Re: Undefined code positions in 8-bit character sets

From: Kenneth Whistler (
Date: Mon May 05 2008 - 20:19:03 CDT

  • Next message: David Starner: "Re: Undefined code positions in 8-bit character sets"

    Doug Ewell said:

    > >> On the other hand, Windows-1252 might be extended again and assign a
    > >> meaning to 0x90, so it is probably better not to map any Unicode
    > >> codepoint to that value.
    > >
    > > I disagree. If you do not map U+0090 to 0x90 for Windows-1252, all you
    > > are doing in ensuring an interoperability bug both with Windows and
    > > with other commercial applications doing conversions.
    > If you are working in either ISO 8859-1 or Windows-1252, and encounter
    > the byte 0x90, you've got problems already. You might do well to ask
    > yourself whether your text is even in one of those encodings, or whether
    > it is mislabeled or a bad assumption was made.

    Sure, if you're working in the "wild", so to speak, dealing
    with conversions of mislabelled documents full of potential
    data corruptions, and having to make use of heuristics to
    determine what actual encodings are, and what is good data
    versus bad data.

    But that is another layer up from what I'm talking about.

    A basic ISO 8859-1 <--> Unicode converter shouldn't be
    stopping on an 0x90 byte, saying "hmmm, I wonder what this
    is all about?" and flagging some exception for potentially
    endless rumination by a heuristic algorithm before returning
    a conversion.

    You basically have two choices:

       0x90 --> U+0090

       0x90 --> U+FFFD
    and the first is what U+0090 was encoded for in the first place
    and is what most commercial converters do, as far as I know.

    Then if you want to stop and ask, "Hey! What is this U+0090
    (or substituted U+FFFD) doing in my 8859-1 data?! I bet there
    is an error here I should check into!" well, that is a
    perfectly valid thing to do. But I think it is conceptually
    (and software architecturally) an epiphenomenon on the basic
    conversion definition.


    This archive was generated by hypermail 2.1.5 : Mon May 05 2008 - 20:21:54 CDT