Chinese, Japanese and Korean
Q: I have heard that UTF-8 does not support
some Japanese characters. Is this correct?
A: There is a lot of misinformation floating around about the
support of Chinese, Japanese and Korean (CJK) characters. The Unicode
Standard supports all of the CJK characters from JIS X 0208, JIS X
0212, JIS X 0221, or JIS X 0213, for example, and many more. This is true
no matter which encoding form of Unicode is used: UTF-8, UTF-16, or
UTF-32.
Unicode supports over 70,000 CJK characters right now, and work is
underway to encode further additions. The International Standard ISO/IEC
10646 and the Unicode Standard are completely synchronized in repertoire
and content. And that means that Unicode has the same repertoire as GB
18030, since that also is synchronized with ISO 10646 — although with a
different ordering
and byte format.
Q: Who is responsible for future CJK
characters?
A: The development and extension of the CJK characters is
being done by the Ideographic Rapporteur Group (IRG), which includes
official representatives of China, Hong Kong (SAR), Macao (SAR),
Singapore, Japan, South Korea, North Korea, Taiwan and Vietnam, plus a
representative from the Unicode consortium. For more information, see the
IRG home page.
The IRG is very carefully cataloging, reviewing, and
assessing CJK characters for inclusion into the standard. The only real
limitation on the number of CJK characters in the standard is the ability
of this group to process them, because the characters are increasingly
obscure (no person — living or deceased — knows more than a fraction of
the set already encoded).
Q: Does the Unified Han character encoding in
Unicode mean that I only need one CJK font for Asia, or do I have to allow
for choices between different styles of CJK fonts for different countries?
A: Broadly speaking, there are four traditions for character
shapes in East Asia: traditional Chinese (used primarily in Taiwan, Hong
Kong, and overseas Chinese communities), simplified Chinese (used
primarily in mainland China and Singapore), Japanese, and Korean. Using a
single font for all four locales allows the characters to be legible, but
means that some characters may look odd. For optimal results a system
localized for use in Japan, for example, should use a font designed
explicitly for use with Japanese, rather than a generic Unihan font.
[JJ] and
[KW]
Q: If the character shapes are different in
different parts of East Asia, why were the characters unified?
A: The Unicode standard is designed to encode characters, not
glyphs. Even where there are substantial variations in the standard way of
writing a character from locale to locale, if the fundamental identity of
the character is not in question, then a single character is encoded in
Unicode.
This principle applies to East Asian scripts as well as to
those of other parts of the world. It is well-recognized that the Han
characters involved are the same, even when used in different
countries to write different languages. In the overwhelming majority of
cases where a Han character is written differently in different locales,
readers from one locale would recognize the form used in another; in all
cases, experts from throughout East Asia would recognize the fundamental
unity of the character.
As a rule, the differences in writing style between the
different East Asian locales are within the general range of allowable
differences within each typographic tradition.
-
E.g., the official "Taiwanese" glyph for
草 U+8349 ("grass") per ISO/IEC 10646 uses four strokes for the "grass"
radical, whereas the PRC, Japanese, and Korean glyphs use three. As it
happens, Apple's LiSung Light font for Big Five (which follows the
"Taiwanese" typographic tradition) uses three strokes, as shown here:
Japanese users prefer to see Japanese text written with
"Japanese" glyphs.
-
There are occasional instances of unified characters whose
typical Chinese glyph and typical Japanese glyph are distinct enough
that the Chinese glyph will be unfamiliar to the typical Japanese
reader, e.g., 直 U+76F4. To prevent legibility problems for Japanese
readers, it is advisable to use a Japanese-style font when presenting
Unihan text to Japanese readers.
It is also typical for Japanese users to see Chinese text
written with "Japanese" glyphs. For example:
-
A standard Japanese dictionary which quotes Chinese authors
(e.g., Mencius) uses "Japanese" glyphs, not Chinese ones.
-
In particular, it is perfectly acceptable within Japanese
typography for stretches of Chinese quoted in a predominantly Japanese
text to be written with "Japanese" glyphs.
Han Unification is designed to preserve legibility. Documents
typically can be simply displayed in the font preferred by the user. Where
a distinction in style needs to be made (for example, Chinese-style vs.
Japanese-style glyphs in the same document), appropriate fonts should be
applied to the specific text as needed.
Because of limitations in existing fonts, it may occasionally happen
that a rare kanji will be displayed using a Chinese-style glyph where
a Japanese-style glyph would be preferred. This is a font issue, not
a character encoding issue, and the same problem can occur with other
character encoding standards.
For more information, see
On the Encoding of Latin, Greek, Cyrillic, and Han.
[JJ]
Q: How can I recognize from the 32 bit value of
a Unicode character if this is a Chinese, Korean or Japanese character?
A: It's basically impossible and largely meaningless. It's the equivalent
of asking if "a" is an English letter or a French one. There are
some characters where one can guess based on the source information
in Unihan.txt that it's traditional Chinese, simplified Chinese,
Japanese, Korean, or Vietnamese, but there are too many exceptions to
make this really reliable. (For example, one particularly nasty
obscenity in Cantonese would probably have never been encoded for
Cantonese, but has made it in for the sake of Korean, where one hopes
it isn't nearly as obscene.)
The phonetic data in Unihan.txt should not be used for this purpose. A
blank in the phonetic data means that nobody's supplied a reading, not
that a reading doesn't exist. Because updating the Unihan database is an
ongoing process, these fields will be increasingly filled out as time goes on,
but they should never be taken as absolutely complete. In particular, there are
obscure characters where it is known that there is a reading, but since the character does not occur in
standard dictionaries, we are unable to supply it (e.g., 䃟 U+40DF in
Cantonese).
A better solution is to look at the text as a whole: if there's a fair
amount of kana, it's probably Japanese, and if there's a fair amount of
hangul, it's probably Korean.
The only proper mechanism is, as for determining whether "chat" is
spelled correctly in English or French, is to use a higher-level
protocol.
[JJ]
Q: How does character input on a keyboard work for Chinese characters?
This is a complicated question. For answers, see
How are Chinese characters input?
Q: Why is Unicode missing some characters
from the Big Five character set?
A: The "Big Five" character set is an industrial standard
commonly used for traditional Chinese. There are, however, several
versions of the Big Five in common use, generally representing extensions
of the formal standard. There are two main versions, "plain Big Five" and
"ETEN Big Five" as well as numerous vendor- or platform- specific
extensions. In recent years, there have been further extensions such as
the Hong Kong Extension to Big Five and Big Five Plus.
The initial, un-extended Big Five was the standard version of
the character set at the time that the Unicode Standard, Version 1.0, was
under development, and Unicode was designed to cover its ideographic
repertoire completely. This is reflected in the data files supplied by the
Unicode Consortium. Some vendors provide vendor-specific tables showing
mapping data for their custom Big Five extensions and Unicode. The Unicode
Consortium does not, however, provide data on every known dialect of the
Big Five, so it is possible that a particular dialect of the Big Five is
not included in the tables provided by Unicode.
[JJ]
Q. I hear that certain characters from the GB18030 encoding are not mapped to any code points in Unicode,
and need to be mapped to characters in the Private Use Area instead. Is this true? And if so, is the issue being dealt with in the near future?
A. That used to be true, as of Unicode 4.0. There were in fact a small number of characters in GB 18030 that had not made it into
Unicode (and ISO/IEC 10646). However, to avoid having to map characters to the PUA for support of GB18030, the missing characters were added as of
Unicode 4.1, so of course, they are in Unicode 5.0 and later versions.
You can find the characters in question in Annex C (p. 92) of GB 18030-2000. All now have regular Unicode characters.
These can be found in the ranges: U+31C0..U+31CF (for CJK strokes) and U+9FA6..U+9FBB (for various CJK characters and components).
[KW]
Q: Isn't it true that some Japanese can't write their own names in Unicode?
A: There are some situations where an individual prefers their
name be written with a specific glyph, as in the West we have John and Jon,
Mark and Marc, Cathy and Kathy. In most cases, variation sequences in the
UTS# 37 Ideographic Variation Database can be used to provide the required
representation in plain text. In other cases, the variant forms have been
encoded in Unicode as distinct characters. The IRG also may consider where
the encoding of new variant characters is justified.
It should be noted that this is not a problem of Han unification per
se, as it is often represented. Unicode is a superset of the major
Japanese character encoding standards. The various JIS standards and
ISO 2022-based encodings have the same limitation. [JJ]
Q: How do Korean letters work in Unicode?
A: There are four main types of encoded Korean letters:
(a) Jamo
(b) Hangul Syllables
(c) compatibility Jamo, and
(d) half-width Jamo.
(c) and (d) are present for compatibility with legacy code pages, and are
not required for the representation of Korean.
[MD]
Q: What are the Hangul Syllables?
A: They can be fundamentally thought of as like composite
characters — a compacted representation of certain sequences of Jamo. Of
course in practice, these are the main characters in actual use, but from
a logical point of view they are simply precomposed sequences, and treated
that way in normalization and other processing.
[MD]
Q: How are the Jamo used?
A: Jamo are divided into three classes: L, V, T (lead, vowel,
trail). A standard syllable consists of L V, or L V T. As long as text is
represented in sequences of these (e.g. L V L V T L V T L V...) there is
no issue. If isolated jamo, such as just an L, are to be
represented, there are two ways to do it:
(a) Simply use L on its own (but this must not be followed by
V).
(b) Use a sequence with a filler, Vf, to make a standard syllable: L Vf
Similarly, for an isolated V, you could use V (if not
preceded by L) or the sequence Lf V, and for isolated T you could use T
(if not preceded by V) or the sequence Lf Vf T.
[MD]
Q: Do you ever get mixtures of Hangul
Syllables and Jamo?
A: Yes, you could. If the text is in NFD, then it will only
contain Jamo. If it is in NFC (or unnormalized), most text will be Hangul
Syllables. However, Jamo could occur in certain circumstances:
(a) isolated Jamo
(b) pre-1933 orthography Korean text
(c) modern incomplete syllables (e.g. syllables without a leading
consonant as used in dictionaries and grammar books)
(d) syllables used for a more faithful phonetic representation of some
dialects
In the latter case, there are two possibilities. If the L or
V are ancient Jamo, then the entire syllable would be in Jamo. If both are
modern Jamo but the T is ancient, then the syllable would be represented
by a sequence of two characters: a single code point for LV, followed by
the code point for the T: <LV, T>
This is similar to the case of Latin. The NFC form of A +
grave + umlaut is <A-grave, umlaut> : part is precomposed and the
remainder is not. [MD] &
[JS]
Q: Does this make any difference in how a
syllable should be displayed?
A: No. Whether a syllable is represented in the form <L, V,
T>, <LVT>, or <LV, T>, it should still be displayed in a single 'cell'.
[MD]
Q: But how should non-standard syllables be
displayed?
A: An L that is not followed by a V should be displayed as if
it were the sequence <L, Vf>. A V that is not preceded by an L should
display as if it were the sequence <Lf, V>. A T that is not preceded by <L,V>
or LV, should display as if it were the sequence <Lf, Vf, T>
[MD]
Q: When mapping to KS X 1001 (formerly known
as KS C 5601), how should I handle conjoining Jamo?
A: The easiest approach is to first convert the text using
NFC . Then
convert any remaining conjoining jamo to the compatibility jamo
characters. For example, U+1100 (ᄀ) to U+3131 (ㄱ). The conjoining filler
characters can simply be removed. [MD]
Q: When mapping to KS X 1001-based MBCS
character encodings, how should I map the 8,822 Unicode Hangul syllables
not covered by KS X 1001?
A: KS X 1001:1998 covers only 2,350 pre-composed Hangul
syllables. The same is true of the KS X 1001-based EUC-KR and ISO-2002-KR
encodings. The rest of the Hangul syllables in Unicode (8,822 of them)
have to be mapped to 8-byte sequences, as specified in Section 3.3 of the
annotations to KS X 1001:1998 (KS C 5601-1992). This works as follows:
The first two octets (<0x24 0x54> in GL and <0xA4 0xD4> in GR)
signify the beginning of a sequence; they are directly followed by 6 bytes
which represent the initial consonant, the medial vowel, and the final
consonant of a Hangul syllable, each using two bytes. By this mechanism,
full round-trip conversion is possible between Unicode and KS X 1001-based
encodings.
Note that both Windows Code Page 949 (Unified Hangul Code)
used in Korean MS-Windows and JOHAB — specified as a supplementary
encoding in KS C 5601-1992 Annex 3 (= KS X 1001:1998 Annex 3) —
equivalent to Windows Code Page 1361 cover the full repertoire of 11,172
Unicode pre-composed Hangul syllables, and thus don't have this mapping
problem. [JS]
Q: Where can I find a Unicode mapping for
EACC?
A: EACC is an American National Standard, East Asian
Character Code for Bibliographic Use (ANSI/NISO Z39.64), developed by the
library community. The Library of Congress specifies use of EACC for CJK
data in MARC 21 records that do not use UTF-8. The Unicode-EACC mapping
approved by the MARBI Committee of the American Library Association is
available on the
MARC 21
Web site. [JA]
Q: Why doesn't the Unihan database include
mappings for all EACC characters?
A: The Unihan database covers only the ideographs in the
Unicode Standard. EACC also includes characters such as Japanese kana and
Korean hangul that are outside the scope of the Unihan database.
[JA]
Q: What is JIS X0213?
A: JIS X0213, 7-bit and 8-bit double byte coded extended Kanji
sets for information interchange, is a new Japanese national standard coded
character set established by JISC (Japanese Industrial Standards Committee).
It was established in January 2000, then revised in February 2004. It
enumerates 11,233 characters, which extends the 4,344 characters of the
JIS X0208 standard. It consists of 10,050 Kanji (ideographic) characters
and 1,183 non-Kanji (non-ideographic) characters. These characters are
arranged in two planes of a 94-row-by-94-cell matrix. Also, as an informative
annex, three encoding methods are defined as extensions of existing de facto
encodings, that is, Shift JIS, EUC-JP, and ISO-2022-JP.
[TO]
Q: How is JIS X0213 related to some existing
JIS standards?
A: There are several JIS coded character set standards. JIS
X0201 is the single-byte coded character set which adapts the ISO/IEC 646
standard in Japan. JIS X0208, JIS X0212 and JIS X0213 are the double-byte
coded character sets, and JIS X0221 is the multi-byte coded character set
which corresponds to ISO/IEC 10646. JIS X0208 is the primary double-byte
coded character set used for Japanese. Although both JIS X0212 and JIS
X0213 Kanji standards have been established as the supplement to JIS X0208
standard, the scopes of their source character sets are different.
[TO]
Q: How is it related to Unicode / ISO/IEC
10646?
A: Almost all characters in JIS X0213 have corresponding
characters in Unicode / ISO/IEC 10646. Only a few non-Kanji characters are
represented by composite sequences in Unicode / ISO/IEC 10646. Kanji
characters are mapped to one of the blocks of CJK Unified Ideographs, CJK
Compatibility Ideographs, CJK Unified Ideographs Extension A, or CJK
Unified Ideographs Extension B in Unicode 4.0 (or later versions) and
corresponding versions of ISO/IEC 10646; or are mapped to CJK
Compatibility Ideographs. [TO]
Q: Where to get more information about JIS
X0213?
A: For more information about JIS X0213 standard, contact the
Japanese Standards
Association. [TO]
Q: I have heard there are problems in
Japanese and other East Asian mapping tables. Where can I find
information about these problems?
A: There are many well-known mapping problems and
discrepancies. For example:
Shift-JIS byte 0x5C can be mapped to U+005C or U+00A5, which are different,
unrelated characters with unrelated glyphs.
Shift-JIS bytes <0x81 0x5C> can be mapped to U+2014 or U+2015,
which look almost the same.
Shift-JIS bytes <0x87 0x82> and <0xFA 0x59> can
both be mapped to U+2116,
but the primary roundtrip mapping may be different between platforms.
That is, what U+2116 maps back to may be different.
Sometimes the standard is ill defined, and each vendor has
a choice in how to implement the Unicode mapping table. Examples include
the Big5-HKSCS and several other codepages. Sometimes the mapping table
varies, even on the same platform. For example, Windows-950 is either
Big5 or Big5-HKSCS, and the later one depends on the user applying a
Windows specific patch. Implementations of ISO 2022 encodings like
ISO-2022-JP differ not only in the mapping tables for the sub-encodings
but also in the supported sets of escape sequences and their invocation
pattern.
The W3C has an extensive technical report "XML
Japanese Profile" which lists a number of known mapping problems. Of
special interest to people with mapping problems are
Appendix
C, Ambiguities in conversion from Shift-JIS to Unicode and
Appendix D,
Ambiguities in conversion from Japanese EUC to Unicode.
IBM's ICU project contains many mapping tables for a variety of
standards. It is available on SourceForge. See the
ICU User Guide,
particularly the section on
Conversion Data. The page
Character Set
Mapping Tables shows a detailed comparison between a number of
different charsets, based on data collected on different platforms.
The obsolete, unmaintained
East
Asian Mapping Tables on the Unicode website also contain some notes
about specific discrepancies. There is an extensive article at Debian by
Tomohiro Kubota on these problems:
Conversion tables differ between vendors. The
article contains a table of discrepancies in various Japanese encodings.
For more information on character mappings and
roundtripping issues, see
UTS #22, Character
Mapping Markup Language.
[GR]
Q: Why doesn't the Unicode Standard adopt a compositional model for encoding Han ideographs?
Wouldn't that save a large number of code points?
A: The Han ideographic script is largely compositional in nature. The overwhelming number of characters created over
the centuries (and still being coined) are made by adjoining two or more old characters in simple geometric relationships. For example,
the Cantonese- specific character U+55F0 嗰 was created by adjoining the two older characters, U+53E3 口 and U+500B 個, one next to the other.
The compositional nature of the script—and, more to the point, the fact that this compositional nature is well-known—means that over time tens of thousands of ideographs have been created, and these are currently encoded in Unicode by using one code
point per ideograph. The result is that some 71,000 code points are consumed by ideographs in Unicode 5.0, nearly three-quarters of
the characters encoded.
The compositional nature of the script makes it attractive to propose a compositional encoding model, such as can be used
for Hangul. Such a mechanism would result in the savings of thousands of code points and relieve the IRG from the burden of having to
examine potential candidates for encoding.
Unfortunately, there are some difficulties involved with a compositional model for Han.
First of all, while the rules for drawing composed Jamos as
Hangul syllables are relatively straightforward, those for Han
are surprisingly complex. To use U+55F0 嗰 as an example again, although it is built structurally out of two pieces, the left piece
occupies far less than 50% of the character's horizontal space. This reduction in size is a result of the nature of U+53E3 口 itself and
doesn't apply to other characters. Either the rendering process would have to be sophisticated enough to take such ideographic idiosyncrasies
into account, or the encoding model would have to provide more information than just the geometric relationship between the composing pieces.
(This is the main reason why the existing Ideographic Description Sequence mechanism is inadequate even for drawing described ideographs.)
Even more difficult is the problem of normalization, which would be necessary for operations such as comparison or searching.
A normalization algorithm would first have to parse the sequence of composing Han for validity, and then make sure that all substrings are
normalized. It should also to be able to recognize a "canonical" form for a sequence of composing Han. Thus, U+55F0 嗰 could be spelled
using three pieces (U+53E3 口, U+4EBB 亻, U+56FA 固) as well as with two. Indeed, since U+4EBB 亻 is a well-known variant form of U+4EBA
人, it could be spelled using that character, as well. Providing a canonical representation would have to take these multiple spellings into account.
The open-ended nature of the script and possibilities for ambiguous spelling make it virtually impossible to guarantee that two characters
made up by two different people would be treated as equivalent even if they look exactly the same and are intended to be equivalent.
Other computer processes such as machine-based translation or text-to- speech would probably have to skip such characters when they
occur in plain text, because there is no simple, authoritative way for these processes to be able to determine even approximate definitions or
pronunciations from the visual representation alone. Even if the data are available, the need to parse strings of variable length before looking
them up creates complications.
Finally, East Asian governments, while aware of the compositional nature of the script, do not wish to actively encourage the coining
of new forms because of the practical problems they create. In particular, new coinages are rarely an aid to communication, since they have no obvious
inherent meaning or pronunciation. They are little more than dingbats littering otherwise intelligible text.
While the number of encodable ideographs has proven far greater than Unicode had originally anticipated, the standard is in no danger
of running out of room for them any time soon. 71,000 ideographs encoded in 17 years amounts to just over 4000 ideographs per year. At this rate,
it would take nearly two hundred years to fill up the available space in Unicode with ideographs.
And while the number of unencoded but useful ideographs is larger than originally anticipated, it is also finite and probably smaller
than the number of ideographs already encoded. The bulk of useful unencoded forms is likely to come from placenames, personal names, or characters
needed for Chinese dialects other than Mandarin and Cantonese. Many unencoded forms occurring in existing texts are actually variants of encoded
characters and would best be represented as such.
While it currently takes several years for the IRG to fully process proposed ideographs so that they can be encoded, steps are being
taken to streamline this, and further steps will be possible in the future should they prove necessary. Indeed, the bulk of the work currently
done by the IRG would still have to be done for composed ideographs in order to provide support for them beyond rendering. [JJ]
Q: Why does Unicode use the term "ideograph" when it is linguistically incorrect?
A: The characters used to write Chinese are traditionally called "Chinese characters" in the various East Asian languages
(hanzi in Mandarin, kanji in Japanese and hanja in Korean). In English, they are generally referred to by names such as "ideograph"
or "pictogram," even though these don't accurately reflect what the characters are or how they are used. Indeed, no single linguistic term
adequately describes these characters because they have such varied origins and uses. The only possible exception would be "sinogram,"
which is Latin for "Chinese character" and rarely found.
Unicode originally adopted the word "ideograph" as representing common English usage. The term is now so pervasive in the standard
that it cannot be abandoned. [JJ]