Big-5+HKSCS => GBK mapping

From: Ken Krugler (
Date: Tue Apr 05 2005 - 16:09:49 CST

    I'm trying to generate a fairly complete mapping between these two
    legacy encodings, where fuzzy equivalence is OK (and preferable to no

    I've been using various .ucm files from ICU, as well as the
    UniHan.txt file (for Simplified & Traditional variants).

    This has worked reasonably well for GBK->Big-5+HKSCS, as expected.
    Out of the 7601 characters in GBK that I've got glyph data for, only
    268 can't be mapped. I could whittle this down a bit by using
    mappings suggested by the cross reference data found in
    NamesList.txt, though each would have to be hand-verified.

    For Big-5+HKSCS->GBK, the situation isn't so great. Out of the 18275
    characters in Big-5+HKSCS that I've got glyph data for, 2162 can't be
    mapped. Most of these (1598) are HKSCS characters that map to U+2xxxx
    code points.

    So does anybody know of such a mapping table that already exists, or
    a suggestion for how to fuzzily resolve a significant number of the
    remaining unmapped HKSCS? I'm pretty sure somebody else has wrestled
    with this same problem.

    And yes, I realize this is a bit like trying to park a Cadillac in a closet :)


    -- Ken

    Ken Krugler
    TransPac Software, Inc.
    +1 530-470-9200

