RE: Code charts

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Apr 10 2001 - 03:41:27 EDT


Tomás McGuinness wrote:
> I am working on a project that involves converting WML and
HTML
> documents from a character set to UCS-2. The problem is that
> the UCS-2 hex representation for say 0x003C (<) is not
present
> in GB2312 [the same glypg I mean].

Notice that some characters normally have two versions in GB
and other Far East encodings.

The first one is the "full width" version (a.k.a. "zenkaku"), which
is encoded as two bytes and whose glyph is two cells wide in
fixed widths fonts.

The full width "<" character in GB is the one that you see in
<http://www.unicode.org/Public/MAPPINGS/EASTASIA/GB/GB12345.TXT>:

        GB Unicode Name
        0x233C 0xFF1C # FULLWIDTH LESS-THAN SIGN

0x233C is a shorthand for 0x23,0x3C ("row" 0x23, "column"
0x3C). These are only "logical" codes that may be byte-serialized in
various ways. E.g., the actual bytes that you would find in an
EUC-GB file would be 0xA3,0xBC (i.e., the "row,column" values +
0x80).

The second code is the "half width" version (a.k.a. "hankaku"). It
it encoded as one single byte and the glyph is one cell wide in
fixed widths fonts.

Hankaku characters in GB always have the same value as in
ASCII, this is why these codes are normally not included in
conversion tables.

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:15 EDT