This message provides a brief description of how the GB2312 encoding
(really EUC-CN, GB2312 is properly a character set, not an encoding)
works, including how to convert between row-cell and hex notation, and
what a octet stream looks like when it contains GB2312 code points.
By way of exposition, I'll use the Simplified Characters for
Zhong1guo2 (China), U+4E2D U+58B1.
The GB2312 hex values for these characters is 0x5650 0x397A. To
convert these to row-cell, subtract 0x2020 from each and convert each
byte to decimal:
GB2312
Hex Value 0x5650 0x397A
- 0x2020 - 0x2020
-------- --------
0x3630 0x195A
Row-Cell 54-48 25-90
So the row-cell values for these characters are 54-48 and 25-90.
In a text stream, GB2312 is encoded using an 8-bit encoding,
EUC-CN. Since GB2312 is a 7-bit encoding, to differentiate the Chinese
characters the high-bit is set, making the 8-bit. To accomplish this,
you 0x80 to the hex value, or 0xA0 to the row-cell value (which makes
sense, since the row-cell value is 0x20 less than the hex value, and
adding 0x80 to the hex value creates the EUC-CN value). So:
GB2312
Hex Value 0x5650 0x397A
+ 0x8080 + 0x8080
-------- --------
EUC-CN 0xD6D0 0xB9FA
And indeed, if you create a GB-2312 encoded file containing Zhong1guo2
and then look at the hex values, this is what you will see. RFC 1922
(which defines ISO-2022-CN) calls this CN-GB encoding.
I know this is confusing, but hopefully this has helped a bit.
-- Tom Emerson Basis Technology Corp. Zenkaku Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT