Re: ANSI and Unicode for x00 - xFF

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 26 2005 - 13:09:04 CST

Next message: Andrew S: "Re: Improper grounds for rejection of proposal N2677"

Previous message: Chris Jacobs: "Re: ANSI and Unicode for x00 - xFF"
Maybe in reply to: Velasquez, Carlos: "ANSI and Unicode for x00 - xFF"
Next in thread: Peter Constable: "RE: ANSI and Unicode for x00 - xFF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Are the 7 bit ASCII characters a subset of the 8 bit ANSI character?

Yes. But there is a problem in referring to "ANSI" as if it were
a character set. "ANSI" is Window-ese for an 8-bit Windows code
page based on ASCII, and usually, specifically, Code Page 1252.
There are differences between Code Page 1252 and ISO/IEC 8859-1 in
the range 0x80..0x9F that tend to lead people astray.

> I understand that the 7 bit ASCII characters are definitely a
> subset of the UTF-8 set but am not sure if ANSI is a subset of UTF-8.

Yes and no. And that is part of why you are confused.

The repertoire of ISO/IEC 8859-1 is a strict subset of the
repertoire of the Unicode Standard. It also lines up code
point by code point, so that the numerical values of the
code points for 8859-1 are identical to the numerical values
for the corresponding characters in the Unicode Standard.

The repertoire of "ANSI" (i.e. Windows Code Page 1252) is
also a strict subset of the repertoire of the Unicode Standard.
But the numerical values for CP 1252 characters in the range
0x80..0x9F don't line up directly against the Unicode Standard,
and have to be mapped one-by-one, instead.

Third issue: UTF-8 is an encoding *form* of the Unicode Standard.
The actual values of the code units that result from using
UTF-8 as an encoding form don't line up identically to
ISO/IEC 8859-1 in the range 0x80..0xFF. Instead, the UTF-8
encoded values take two bytes for characters encoded in that
range. Thus the *encoded characters* are all the same, but
the actual bytes used for the encoding are different.

Here is the summary, laid out in terms of a table of
mappings.

Unicode 8859-1 CP 1252 UTF-8

U+0000 0x00 0x00 0x00
U+0001 0x01 0x01 0x01
...
U+007E 0x7E 0x7E 0x7E
U+007F 0x7F 0x7F 0x7F
U+0080 0x80 NOTDEF 0xC2 0x80
U+0081 0x81 0x81 0xC2 0x81
U+0082 0x82 NOTDEF 0xC2 0x82
...
U+009F 0x9F NOTDEF 0xC2 0x9F
U+00A0 0xA0 0xA0 0xC2 0xA0
...
U+00FF 0xFF 0xFF 0xC3 0xBF

And for some of the characters in CP 1252 outside the
range of U+0000..U+FFFF:

U+20AC NOTDEF 0x80 0xE2 0x82 0xAC (EURO SIGN)
...
U+2022 NOTDEF 0x95 0xE2 0x80 0xA2 (BULLET)
...
U+0178 NOTDEF 0x9F 0xC4 0xB8 (LATIN CAPITAL LETTER Y WITH DIAERESIS)

etc.

Take a look at the FAQ:

http://www.unicode.org/faq/utf_bom.html

That is a good place to get started on issues related to understanding
UTF-8 (as well as UTF-16 and UTF-32).

--Ken

Next message: Andrew S: "Re: Improper grounds for rejection of proposal N2677"
Previous message: Chris Jacobs: "Re: ANSI and Unicode for x00 - xFF"
Maybe in reply to: Velasquez, Carlos: "ANSI and Unicode for x00 - xFF"
Next in thread: Peter Constable: "RE: ANSI and Unicode for x00 - xFF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Oct 26 2005 - 13:10:47 CST