Chinese and Japanese

Q: What does the abbreviation “CJK” mean?

It is a commonly used abbreviation for “Chinese, Japanese, and Korean.” The term “CJK character” generally refers to “Chinese characters,” or more specifically, the Chinese (aka Han) ideographs used in the writing systems of the Chinese and Japanese languages, occasionally for Korean, and historically for Vietnamese. Occasionally, the abbreviation “CJKV” is used to include Vietnamese.

Q: Are Chinese characters used in Korean?

Yes, but mostly for older and traditional literary materials. Modern Korean is written almost entirely with a separate system of Hangul Syllables constructed of consonants and vowels called Jamo.

Q: Where can I find out more about Hangul Syllables and Jamo for Korean?

There is a separate FAQ on Korean dealing with Hangul Syllables and Jamo.

Q: Who is responsible for future additions of CJK characters?

The development of CJK Unified Ideograph extension blocks is managed by the Ideographic Research Group (IRG), which includes experts from China, Hong Kong SAR, Macao SAR, Japan, South Korea, TCA (Taiwan Computer Association), UK, Vietnam, and the Unicode Consortium. For more information, see the IRG home page.

The IRG is very carefully cataloging, reviewing, and assessing CJK characters for inclusion into the standard. The only real limitation on the number of CJK characters in the standard is the ability of the IRG to process them, and newly proposed characters are increasingly obscure.

Q: What is the process for proposing new CJK unified ideographs?

Newly proposed CJK unified ideographs are first submitted to the IRG through national bodies or liaison organizations, and are then assembled into a new “IRG Working Set” that goes through several rounds of detailed review and scrutiny before being approved for standardization as a new CJK Unified Ideographs extension block. Individuals who wish to propose the encoding of new CJK unified ideographs are encouraged to work with their respective country’s national body.

Q: Do the different CJK fonts styles of for different countries require multiple fonts?

Broadly speaking, there are four conventions for character shapes in East Asia: traditional Chinese (used primarily in Taiwan, Hong Kong SAR, Macao SAR, and overseas Chinese communities), simplified Chinese (used primarily in China and Singapore), Japanese, and Korean. Using a font with one glyph per code point for all four conventions allows the characters to be understandable, but some characters may look odd to readers in a particular region. For optimal results, a system localized for use in Japan, for example, should use a font designed explicitly for Japanese rather than a generic Unihan font. [JJ]

It is possible to design and develop Pan-CJK or Pan-Chinese typefaces whose fonts support multiple regional conventions by including more than one glyph per code point, and such fonts already exist. Noto Sans CJK, Noto Serif CJK, Source Han Mono, Source Han Sans, and Source Han Serif are open source Pan-CJK typefaces whose fonts support five regional conventions. PingFang, which is an Apple OS system font, is a Pan-Chinese typeface whose fonts support four regional conventions.

Q: If the character shapes are different in different parts of East Asia, why were the characters unified?

The Unicode Standard is designed to encode scripts and their characters, not their specific shapes, or glyphs. Even where there are substantial variations in the standard way of writing a character from region to region, if the fundamental identity of the character is not in question, then a single character is encoded in the standard.

This principle applies to East Asian scripts as well as to those of other parts of the world. It is well-recognized that the Han characters involved are the same, even when used in different regions to write different languages. In the overwhelming majority of cases where a Han character is written differently in different regions, users from one region would recognize the form used in another. In all cases, East Asian experts would recognize the fundamental identity of the character.

As a rule, the differences in writing style between the different East Asian regions are within the general range of allowable differences within each typographic tradition. For example, the “grass” radical, which serves as a component in thousands of CJK unified ideographs, appears as a four-stroke form (⺿) in Taiwan, Hong Kong SAR, and Macao SAR, but as a three-stroke form (⺾) in other regions.

Japanese users prefer Japanese text written with “Japanese” glyphs. Some Chinese glyphs of CJK unified ideographs are distinct enough from their Japanese equivalents as to be somewhat unfamiliar to Japanese users. It is therefore advisable that a CJK font intended for Japanese be used for presenting Japanese text to Japanese users. It is also possible for Japanese users to see Chinese text written with “Japanese” glyphs. For example, Japanese references that quote Chinese authors or text typically use “Japanese” glyphs, not Chinese ones.

Han Unification is intended to preserve legibility. Due to limitations in existing fonts, a rare kanji may be displayed using a Chinese glyph where a Japanese glyph would be preferred. This is a font issue, not a character encoding issue.

For more information, see Unicode Technical Note #26, On the Encoding of Latin, Greek, Cyrillic, and Han. [JJ]

Q: How can I determine whether a Unicode character is Chinese, Japanese, or Korean?

It’s largely impossible and the attempt basically meaningless. It’s the equivalent of asking if “a” is an English letter or a French one. There are some characters for which one can guess based on the source information in the Unihan Database whether they are traditional Chinese, simplified Chinese, Japanese, Korean, or even Vietnamese, but there are far too many exceptions to make this reliable.

In particular, the reading data in the Unihan Database should not be used for this purpose. A lack of reading data simply means that nobody supplied a reading, not that a reading doesn’t exist. Because updating the Unihan Database is an ongoing process, these properties will be increasingly populated as time goes on, but they should never be taken as absolutely complete.

A better solution would be to examine the text as a whole: if there’s a fair amount of kana, it’s probably Japanese, and if it’s mostly Hangul Syllables, it’s probably Korean.

The only proper mechanism to infer the intended language from the text is to use more complex mechanisms such as language heuristics. [JJ]

Q: What is a “horizontal extension”?

This is a term of art used by the Ideographic Research Group (IRG). It refers to the process of adding a new IRG source reference to an existing CJK unified ideograph, along with a new representative glyph for the code charts that shows how the character appears in its source. This typically occurs when an existing CJK unified ideograph is found to be useful for an additional region or language than was previously reflected in the IRG source data.

Q: How does character input on a keyboard work for Chinese characters?

This is a question with a complicated answer. For answers, see How are Chinese characters input?

Q. Is it true that some characters in the GB 18030 standard must be mapped to code points in the Private Use Area?

This was true for the first two editions of the GB 18030 standard, GB 18030-2000 and GB 18030-2005. The third edition, GB 18030-2022, lifted the PUA requirement. For more information, see the article The GB 18030-2022 Standard.

Q: Are there Japanese names that cannot be expressed in Unicode?

Some Japanese family and given names are expressed using kanji that are considered unifiable variants, and are therefore unified with an existing CJK unified ideograph. A small number of these variants have been separately encoded, such as U+9AD9 髙, which is a variant of U+9AD8 高. Some have been encoded as CJK compatibility ideographs, which are affected by all four normalization forms. For example, the CJK compatibility ideograph U+FA38 器, which is among the kanji in Japan’s Jinmei-yō Kanji (人名用漢字 Chinese characters for use in personal names) list, becomes the CJK unified ideograph U+5668 器 when normalized. A substantial number of these variants can be represented in plain text either as Standardized Variation Sequences (SVSes), in the case of CJK compatibility ideographs, or as Ideographic Variation Sequences (IVSes) that have been registered according to the procedures of UTS #37, Unicode Ideographic Variation Database.

It should be noted that this is not a problem of Han unification per se, as it is often perceived. The Unicode Standard is a superset of the various JIS character set standards and their legacy encodings, which share these limitations.

Q: Why didn’t the Unicode Standard adopt a compositional model for encoding Han ideographs?

The Han script is indeed compositional in nature. The overwhelming number of characters created over the centuries—and still being coined—are made by joining two or more existing characters in various geometric relationships. For example, the Cantonese-specific character U+55F0 嗰 was created by adjoining the two characters, U+53E3 口 and U+500B 個, next to each other, and U+500B 個 itself was created by similar means.

The compositional nature of the script—and, more to the point, the fact that its compositional nature is well-understood—means that over time tens of thousands of ideographs have been created, and these are currently encoded in the Unicode Standard by using one code point per ideograph. The result is that tens of thousands of code points are used for the Han script in the Unicode Standard, which represent over two-thirds of the characters that have been encoded. The compositional nature of the script therefore makes it attractive to propose a compositional encoding model, similar to the one that is used to compose arbitrary Hangul Syllables from sequences or two or three Jamo. Such a mechanism would result in the savings of thousands of code points and relieve the IRG from the burden of having to examine potential candidates for encoding.

Unfortunately, there are several inherent difficulties with a compositional model for the Han script.

First of all, the rules for composing ideographs are surprisingly complex. To use U+55F0 嗰 as an example again, although it is composed of two parts, the left part occupies far less than 50% of the character’s horizontal space. This reduction in size is a result of the nature of U+53E3 口 itself and doesn’t apply to other characters. Either the rendering process would have to be sophisticated enough to take such ideographic idiosyncrasies into account, or the compositional encoding model would have to provide more information than just the geometric relationship between the composing pieces. (This is also why the existing Ideographic Description Sequence mechanism is inadequate for drawing described ideographs.) Other inherent difficulties include normalization of the compositional sequences, the ambiguous nature of the compositional sequences, and the mapping of compositional sequences to meanings or readings.

While the number of encodable ideographs has proven far greater than the Unicode Consortium had originally anticipated, the standard is in no danger of running out of unassigned code points for them any time soon. Approximately 100,000 ideographs encoded in approximately 30 years amounts to just over 3,200 ideographs per year. At this rate, which is actually declining, it would take over 250 years to fill up the unassigned code points in the standard with ideographs.

Although the number of still unencoded but useful ideographs is larger than originally anticipated, the set is also finite and probably smaller than the number of ideographs already encoded. The bulk of them is likely to come from place names, personal names, or characters needed for Chinese dialects other than Mandarin and Cantonese. Many unencoded forms occurring in existing texts are actually variants of encoded characters and would best be represented as such. [JJ]

Q: Why does Unicode use the term “ideograph” when it is linguistically incorrect?

The characters used to write Chinese are traditionally called “Han characters” in the various East Asian languages, such as hànzì in Mandarin, kanji in Japanese, and hanja in Korean. In English, they are generally called “Chinese characters,” or are referred to by the terms “ideograph” or “pictogram,” even though these don’t accurately reflect what the characters are or how they are used. Indeed, no single linguistic term adequately describes these characters, because they have such varied origins and uses. The only possible exception would be “sinogram,” which is Latin for “Chinese character” and rarely used.

The Unicode Standard originally adopted the term “ideograph” as representing common English usage. The term is now so pervasive in the standard that it cannot be abandoned or replaced. [JJ]

Q: What’s the difference between the Unicode character properties “Ideographic” and “Unified_Ideograph”?

The Unified_Ideograph property (short name: UIdeo) is used to specify the exact set of CJK Unified Ideographs in the Unicode Standard. In other words, it applies only to CJK unified ideographs in the Han script, and not to CJK compatibility ideographs nor other characters that behave like ideographs.

The Ideographic property (short name: Ideo), on the other hand, applies to all ideographs—not just CJK unified ones. It also applies to characters, such as U+3007 〇 IDEOGRAPHIC NUMBER ZERO, which behave as though they were ideographs. Furthermore, the Ideographic property is not constrained to apply only to characters of the Han script, and therefore applies to characters of the Khitan Small Script, Nüshu, and Tangut scripts.

Q: What’s the best way to determine the Chinese reading of an ideograph?

Most Chinese characters have only one reading in any given variety of spoken Chinese, and roughly half of those with multiple readings have only one in general use. Exceptions are common enough that text-to-speech engines dealing with runs of Chinese text must take semantics and context into account when determining their readings.

When sufficient context is unavailable, the best Mandarin reading is the one specified by the kMandarin property. This is derived algorithmically from the kHanyuPinlu, kXHC1983, and kHanyuPinyin properties, with corrections and additions provided by Chinese experts. When the kMandarin property provides multiple readings, the first one is preferred for zh-Hans (CN), and the second one is preferred for zh-Hant (TW).

In similar contexts, the best Cantonese reading is specified by the kCantonese property, which specifies the most customary Cantonese reading of a character in Jyutping romanization. For information on how this is selected, please refer to Unicode Standard Annex #38.

Q: How does this relate to the pinyin ordering and transliteration in CLDR?

The kMandarin property is used for pinyin ordering and transliteration in CLDR.