Re: ANSI and Unicode for x00 - xFF

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 26 2005 - 13:09:04 CST

  • Next message: Andrew S: "Re: Improper grounds for rejection of proposal N2677"

    > Are the 7 bit ASCII characters a subset of the 8 bit ANSI character?

    Yes. But there is a problem in referring to "ANSI" as if it were
    a character set. "ANSI" is Window-ese for an 8-bit Windows code
    page based on ASCII, and usually, specifically, Code Page 1252.
    There are differences between Code Page 1252 and ISO/IEC 8859-1 in
    the range 0x80..0x9F that tend to lead people astray.

    > I understand that the 7 bit ASCII characters are definitely a
    > subset of the UTF-8 set but am not sure if ANSI is a subset of UTF-8.

    Yes and no. And that is part of why you are confused.

    The repertoire of ISO/IEC 8859-1 is a strict subset of the
    repertoire of the Unicode Standard. It also lines up code
    point by code point, so that the numerical values of the
    code points for 8859-1 are identical to the numerical values
    for the corresponding characters in the Unicode Standard.

    The repertoire of "ANSI" (i.e. Windows Code Page 1252) is
    also a strict subset of the repertoire of the Unicode Standard.
    But the numerical values for CP 1252 characters in the range
    0x80..0x9F don't line up directly against the Unicode Standard,
    and have to be mapped one-by-one, instead.

    Third issue: UTF-8 is an encoding *form* of the Unicode Standard.
    The actual values of the code units that result from using
    UTF-8 as an encoding form don't line up identically to
    ISO/IEC 8859-1 in the range 0x80..0xFF. Instead, the UTF-8
    encoded values take two bytes for characters encoded in that
    range. Thus the *encoded characters* are all the same, but
    the actual bytes used for the encoding are different.

    Here is the summary, laid out in terms of a table of
    mappings.

    Unicode 8859-1 CP 1252 UTF-8

    U+0000 0x00 0x00 0x00
    U+0001 0x01 0x01 0x01
    ...
    U+007E 0x7E 0x7E 0x7E
    U+007F 0x7F 0x7F 0x7F
    U+0080 0x80 NOTDEF 0xC2 0x80
    U+0081 0x81 0x81 0xC2 0x81
    U+0082 0x82 NOTDEF 0xC2 0x82
    ...
    U+009F 0x9F NOTDEF 0xC2 0x9F
    U+00A0 0xA0 0xA0 0xC2 0xA0
    ...
    U+00FF 0xFF 0xFF 0xC3 0xBF

    And for some of the characters in CP 1252 outside the
    range of U+0000..U+FFFF:

    U+20AC NOTDEF 0x80 0xE2 0x82 0xAC (EURO SIGN)
    ...
    U+2022 NOTDEF 0x95 0xE2 0x80 0xA2 (BULLET)
    ...
    U+0178 NOTDEF 0x9F 0xC4 0xB8 (LATIN CAPITAL LETTER Y WITH DIAERESIS)

    etc.

    Take a look at the FAQ:

    http://www.unicode.org/faq/utf_bom.html

    That is a good place to get started on issues related to understanding
    UTF-8 (as well as UTF-16 and UTF-32).

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Oct 26 2005 - 13:10:47 CST