[Unicode]  Unihan Database Home | Site Map | Search
 

Unihan Database

 

UTF-8

About the Unihan Database

As a handy reference, we provide here the contents of a database of information regarding the CJK Unified Ideographs (Unihan) which are a part of the Unicode Standard.

This database originated with data provided by Research Libraries Group, Xerox, Taligent, and Apple. Chinese and Japanese compound data presented in the on-line database come from the on-line CEDICT and Jim Breen's EDICT projects. These additional data are not available in the text-file version.

There are also two indices for the database, a grid index grouping the characters in blocks of 256 and a radical-stroke index. A search page is also available. Individual characters can be accessed through the index or via the "Lookup" button and text field above. Enter the four- or five-digit hexadecimal identifier for the character, and click "Lookup." You will be taken to an information page for the character. The "UTF-8" check-box allows you to control whether UTF-8 will be used in to display ideographs or embedded GIFs. The latter technique is less dependent on your browser and system support for Unicode but is slower.

The on-line database includes the following information for individual glyphs, together with a (N) or (I) to indicate whether they are normative or informative parts of the standard:

  • (I) Several different glyphs for the character
    • The "standard" glyph used in producing the Unicode Standard, Version 4.0.
    • A glyph provided by your UTF-8 savvy Web browser.
    • A glyph or glyphs designed for use with traditional Chinese.
    • A glyph or glyphs designed for use with simplified Chinese.
    • A glyph or glyphs designed for use with Japanese.
    • A glyph or glyphs designed for use with Korean.
  • (N) Different representations of the character's scalar value
    • A decimal value (for reference purposes)
    • The UTF-8 encoding form as a series of hexadecimal octets
    • The UTF-16 encoding form as a series of sixteen-bit words
    • The UTF-32 encoding form as a single 32-bit long word
  • (N) Mappings to the IRG sources for the character. An explanation of the sources and how the IRG unifies characters can be found in The Unicode Standard, Chapter 12 (PDF).
  • (I) Mappings to major industrial and national standards and other character collections
    • Chinese standards
      • GB 2312-1980
      • GB 12345-90
      • CNS 11643-1986
      • CNS 11643-1992
      • CCCII, level 1
      • Big Five (without the ETEN extensions)
      • HK SCS
    • Japanese standards
      • JIS X 0208-1990
      • JIS X 0212-1990
      • JIS X 0213-2000
    • Korean standards
      • KS X 1001:1992
      • KS X 1002:1991
    • Other standards or collections
      • ANSI Z39.64-1989 (EACC)
      • Xerox Chinese
      • PRC telegraph codes
      • Taiwan telegraph codes (CCDC)
  • (N) Positions in the four dictionaries used by the IRG
    • KangXi Zidian, 7th edition (1989). Peking: Zhonghua Bookstore.
    • Dai Kan-Wa Jiten, revised edition (1986). Tokyo: Taishuukan Shoten.
    • Hanyu Da Zidian, 1st edition (1986). Chengdu: Sichuan Cishu Publishing.
    • Dae Jaweon, 1st edition (1988). Seoul: Samseoung Publishing Co., Ltd.
  • (I) Positions in other commonly-used dictionaries
    • Nelson
    • Mathews' Chinese-English Dictionary by Robert H. Mathews, Cambrige: Harvard University Press, 1975.
    • The Student's Cantonese-English Dictionary, third edition, by Bernard F. Meyer and Theodore F. Wempe, Hong Kong: the Catholic Truth Society-Hong Kong, 1978.
    • A Practical Cantonese-English Dictionary, by Sidney Lau, Hong Kong: The Government Printer, 1977.
  • (I) Radical-stroke counts as derived from different sources. The radicals are numbered 1 to 214 and correspond to the traditional 214 radicals from the KangXi dictionary. The stroke counts indicate additional strokes.
    • The main radical-stroke count used for printing the Unicode Standard
    • The radical-stroke count for the KangXi dictionary
    • A general Japanese radical-stroke count
    • A general Korean radical-stroke count
    • The radical-stroke count for the Dai Kan-Wa Jiten
  • (I) Phonetic data derived from various sources
    • Chinese pronunciations
      • Cantonese (in jyutping romanization)
      • Mandarin (in a modified pinyin)
      • Tang dynasty pronunciations
    • Japanese pronunciations
      • Japanese On (Sino-Japanese)
      • Japanese Kun (native Japanese)
    • Korean/Sino-Korean
  • (I) Other dictionary data
    • The character's definition
    • The total number of strokes in the character
    • The phonetic for the character keyed to Ten Thousand Characters: An Analytic Dictionary by G. Hugh Casey, S.J. Hong Kong: Kelley and Walsh, 1980
  • (I) Variants (with links to the variant forms)
    • The simplified form(s) for this character
    • The traditional form(s) for this character
    • Semantic variations (i.e., y-variants, other characters with distinct shapes but virtually identical meanings)
    • Specialized semantic variations (i.e., other characters which may substitute for this character only in certain contexts, such as the accounting numerals)
    • Z-variants (shape variants encoded in the standard for historical reasons)
  • Compounds containing the character (not a part of the Unicode database)
    • Chinese compounds, derived largely from the CEDICT project
    • Japanese compounds from the EDICT project
  • (I) Other information contained in the Unihan database. This is largely limited to alternate possible positions for the character in IRG dictionaries and mappings to minor standards. These data may be deprecated in future releases of the Unihan database.

Normative data is either algorithmically determined or has been extensively proofed and re-proofed by the IRG. It may not change. Informative data has been proofed but may still contain inaccuracies. Such data are therefore not appropriate for use in commercial products but are provided for public reference. Errors in the informative data should be reported by following this link to our online error-reporting page.

Some of the informative fields are still in the process of being populated. Other non-commercial data may be contributed to this database; contact the Unicode Consortium for more information.

The Complete Unihan Database

A complete copy of the Unihan database is available as a (very large) zipped text file on the Unicode Consortium's official ftp site. This text file includes all the data of the on-line database plus additional information. Information on how to parse the file is included in the file itself. For an overview see the description of Unihan fields in the accompanying Unihan.html file.

Unihan Code Charts and Index

The Unihan Radical-Stroke index is documented in Chapter 18 of The Unicode Standard (PDF). The index itself is available online in two PDF files, the Full RS Index and the II Core RS Index. Code Charts covering all of Unihan are available in PDF format, linked from the main chart index page along with other code charts.

Disclaimers

The Unihan database is provided as a public service by Unicode, Inc. These data are provided as-is by Unicode, Inc. (The Unicode Consortium). No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided.