Unihan Database
About the Unihan Database
As a handy reference, we provide here the contents of a database
of information regarding the CJK Unified Ideographs (Unihan) which
are a part of the Unicode Standard.
This database originated with data provided by Research Libraries
Group, Xerox, Taligent, and Apple. Chinese and Japanese compound
data presented in the on-line database come from the on-line
CEDICT and
Jim
Breen's EDICT projects. These additional data are not available
in the text-file version.
There are also two indices for the database, a grid index grouping
the characters in blocks of 256 and a radical-stroke index. A search
page is also available. Individual characters can be accessed
through the index or via the "Lookup" button and text field above.
Enter the four- or five-digit hexadecimal identifier for the
character, and click "Lookup." You will be taken to an information
page for the character. The "UTF-8" check-box allows you to control
whether UTF-8 will be used in to display ideographs or embedded GIFs.
The latter technique is less dependent on your browser and system
support for Unicode but is slower.
The on-line database includes the following information for
individual glyphs, together with a (N) or (I) to indicate whether
they are normative or informative parts of the standard:
- (I) Several different glyphs for the character
- The "standard" glyph used in producing the Unicode Standard,
Version 4.0.
- A glyph provided by your UTF-8 savvy Web browser.
- A glyph or glyphs designed for use with traditional Chinese.
- A glyph or glyphs designed for use with simplified Chinese.
- A glyph or glyphs designed for use with Japanese.
- A glyph or glyphs designed for use with Korean.
- (N) Different representations of the character's scalar value
- A decimal value (for reference purposes)
- The UTF-8 encoding form as a series of hexadecimal octets
- The UTF-16 encoding form as a series of sixteen-bit words
- The UTF-32 encoding form as a single 32-bit long word
- (N) Mappings to the IRG sources for the character. An
explanation of the sources and how the IRG unifies characters can
be found in The Unicode Standard,
Chapter 12 (PDF).
- (I) Mappings to major industrial and national standards and
other character collections
- Chinese standards
- GB 2312-1980
- GB 12345-90
- CNS 11643-1986
- CNS 11643-1992
- CCCII, level 1
- Big Five (without the ETEN extensions)
- HK SCS
- Japanese standards
- JIS X 0208-1990
- JIS X 0212-1990
- JIS X 0213-2000
- Korean standards
- KS X 1001:1992
- KS X 1002:1991
- Other standards or collections
- ANSI Z39.64-1989 (EACC)
- Xerox Chinese
- PRC telegraph codes
- Taiwan telegraph codes (CCDC)
- (N) Positions in the four dictionaries used by the IRG
- KangXi Zidian, 7th edition (1989). Peking: Zhonghua
Bookstore.
- Dai Kan-Wa Jiten, revised edition (1986). Tokyo:
Taishuukan Shoten.
- Hanyu Da Zidian, 1st edition (1986). Chengdu: Sichuan
Cishu Publishing.
- Dae Jaweon, 1st edition (1988). Seoul: Samseoung
Publishing Co., Ltd.
- (I) Positions in other commonly-used dictionaries
- Nelson
- Mathews' Chinese-English Dictionary by Robert H.
Mathews, Cambrige: Harvard University Press, 1975.
- The Student's Cantonese-English Dictionary, third
edition, by Bernard F. Meyer and Theodore F. Wempe, Hong Kong:
the Catholic Truth Society-Hong Kong, 1978.
- A Practical Cantonese-English Dictionary, by Sidney
Lau, Hong Kong: The Government Printer, 1977.
- (I) Radical-stroke counts as derived from different sources.
The radicals are numbered 1 to 214 and correspond to the
traditional 214 radicals from the KangXi dictionary. The
stroke counts indicate additional strokes.
- The main radical-stroke count used for printing the Unicode
Standard
- The radical-stroke count for the KangXi dictionary
- A general Japanese radical-stroke count
- A general Korean radical-stroke count
- The radical-stroke count for the Dai Kan-Wa Jiten
- (I) Phonetic data derived from various sources
- Chinese pronunciations
- Cantonese (in jyutping romanization)
- Mandarin (in a modified pinyin)
- Tang dynasty pronunciations
- Japanese pronunciations
- Japanese On (Sino-Japanese)
- Japanese Kun (native Japanese)
- Korean/Sino-Korean
- (I) Other dictionary data
- The character's definition
- The total number of strokes in the character
- The phonetic for the character keyed to Ten Thousand
Characters: An Analytic Dictionary by G. Hugh Casey, S.J.
Hong Kong: Kelley and Walsh, 1980
- (I) Variants (with links to the variant forms)
- The simplified form(s) for this character
- The traditional form(s) for this character
- Semantic variations (i.e., y-variants, other characters with
distinct shapes but virtually identical meanings)
- Specialized semantic variations (i.e., other characters
which may substitute for this character only in certain
contexts, such as the accounting numerals)
- Z-variants (shape variants encoded in the standard for
historical reasons)
- Compounds containing the character (not a part of the
Unicode database)
- Chinese compounds, derived largely from the CEDICT project
- Japanese compounds from the EDICT project
- (I) Other information contained in the Unihan database. This
is largely limited to alternate possible positions for the
character in IRG dictionaries and mappings to minor standards.
These data may be deprecated in future releases of the Unihan
database.
Normative data is either algorithmically determined or has been
extensively proofed and re-proofed by the IRG. It may not change.
Informative data has been proofed but may still contain
inaccuracies. Such data are therefore not appropriate for use in
commercial products but are provided for public reference. Errors in
the informative data should be reported by following this link to
our online
error-reporting page.
Some of the informative fields are still in the process of being
populated. Other non-commercial data may be contributed to this
database; contact the Unicode Consortium for more information.
The Complete Unihan Database
A complete copy of the Unihan database is available as a (very
large) zipped
text file
on the Unicode Consortium's official ftp site. This text file
includes all the data of the on-line database plus additional
information. Information on how to parse the file is included in the
file itself. For an overview see the description of
Unihan fields in the accompanying
Unihan.html file.
Unihan Code Charts and Index
The Unihan Radical-Stroke index is documented in
Chapter 18 of
The Unicode Standard (PDF). The index itself is available online in two
PDF files, the
Full RS Index and the
II Core RS Index. Code
Charts covering all of Unihan are available in PDF format,
linked from the main chart index page
along with other code charts.
Disclaimers
The Unihan database is provided as a public service by Unicode,
Inc. These data are provided as-is by Unicode, Inc. (The Unicode
Consortium). No claims are made as to fitness for any particular
purpose. No warranties of any kind are expressed or implied. The
recipient agrees to determine applicability of information provided.