Re: Undefined code positions in 8-bit character sets

From: Kenneth Whistler (
Date: Mon May 05 2008 - 14:29:32 CDT

  • Next message: Kenneth Whistler: "Re: Stability Policy Update"

    > Andreas Prilop wrote on Monday, May 05, 2008 4:30 PM
    > >I refer to
    > >
    > >
    > >
    > > In ISO-8859-1, code position 0x90 is mapped to U+0090.
    > > In Windows-1252, code position 0x90 is listed as "undefined".
    > >
    > > Why are they treated differently?

    Different theory by the maintainers of the two sets of files.

    I am the most recent maintainer of record for the 8859-X mapping
    files posted on the Unicode website. For those I follow the
    consensus of the UTC that mappings for control code points
    in the 8859-X family of ASCII-derived encodings to/from Unicode
    is least problematical if 0x00 <--> U+0000, 0x01 <--> U+0001,
    etc. This is, in fact, the way that almost all commercial
    conversions handle the control code conversions for 8859-X
    character sets.

    Since 8859-1.TXT and the other mapping tables posted on the
    Unicode website are intended to provide practical *mapping*
    guidelines for implementers, it would be pedantic in
    the extreme (and counterproductive) to post them up as
    documentation of the 8859-X standards *without* the control
    code mappings.

    The Microsoft mapping tables are contributed by and maintained
    by Microsoft, and follow Microsoft standards practice for
    table definition. 0x00..0x1F are mapped through to U+0000..U+001F,
    but because most Microsoft code pages contain graphic characters
    in the 0x80..0x9F range, those characters are mapped, but
    unassigned code points are simply left #UNDEFINED, as is
    also the case for Microsoft double-byte code pages. This allows
    a distinction to be made between that status and #DBCS LEAD BYTE

    In practice, of course, when actually implmenting conversion
    tables from Microsoft code pages to/from Unicode, nearly all
    commercial implementations, including Microsoft's, map undefined
    values in the 0x80..0x9F range (for non-DBCS code pages) to
    the corresponding Unicode U+0080..U+009F control code character,
    rather than to U+FFFD.

    > > International Standard ISO/IEC 8859-1 does *not* define
    > > code position 0x90. So it might also be listed as "undefined".
    > 0x90 is defined in the IANA version of ISO-8859-1, which calls up the
    > description in RFC1345. In a web context, I believe the IANA definition
    > should take precedence over ISO/IEC.

    While I agree with the conclusion that for web usage, mappings that
    map through control codes rather than treating them as undefined
    is the correct thing to do -- I do so for different reasons.

    RFC 1345 is *extremely* dated. It is from 1992, and refers to
    prepublication versions of 10646. The first edition of 10646
    wasn't even published until 1993, and at that point we are
    talking about a Unicode 1.1-level repertoire. The character
    mnemonic table in RFC 1345 is full of errors, and the mapping
    tables for various charsets at the end of RFC 1345 have not
    been updated to track the updates of the 8859 standards nor
    the updates in mapping practice for some charsets that resulted
    from extensions to 10646.

    > On the other hand, Windows-1252 might be extended again and assign a meaning
    > to 0x90, so it is probably better not to map any Unicode codepoint to that
    > value.

    I disagree. If you do not map U+0090 to 0x90 for Windows-1252, all
    you are doing in ensuring an interoperability bug both
    with Windows and with other commercial applications doing


    > > Or, for purely practical reasons, 0x90 in Windows-1252 might
    > > also be mapped to U+0090.
    > Which is reported to be what Windows *currently* actually does.
    > Richard.

    This archive was generated by hypermail 2.1.5 : Mon May 05 2008 - 14:32:28 CDT