Q: How do Korean letters work in Unicode?
There are four main types of encoded Korean letters:
- Jamo (conjoining)
- Hangul Syllables
- compatibility Jamo
- half-width Jamo
Note that (c) and (d) are present for compatibility with legacy code pages, and are not required for the representation of Korean.
Q: What are Hangul Syllables?
There are 11,172 Hangul Syllables that are directly encoded as a compact representation of certain sequences of Jamo. They can be fundamentally thought of as composite characters. In practice, they are the main characters in actual use, but from an implementation point of view, they are simply precomposed sequences, and are treated as such during normalization and other processing.
Q: How are Jamo used?
Conjoining Jamo are divided into three classes: L, V, T (Leading consonant, Vowel, Trailing consonant). A Hangul Syllable consists of <LV> or <LVT> sequences. As long as text is represented as sequences, such as < L, V, L, V, T, L, V, T, L, V >, there is no issue. If isolated jamo, such as only L, V, or T, are to be represented, there are two ways to do it:
- Simply use L, V, or T on their own, but L must not be followed by V, V must not be preceded by L, and T must not be preceded by V.
- Use a sequence with explicit filler Jamo, Lf (U+115F HANGUL CHOSEONG FILLER) and/or Vf (U+1160 HANGUL JUNGSEONG FILLER), to form a complete syllable:<L, Vf>, <Lf, V>, or <Lf, Vf, T>.
Jamo sequences are unlike combining sequences: the L are not “base characters” and the V or T are not “combining marks”. For that reason the Jamo are called conjoining.
Q: How many possible Jamo sequences exist?
1,638,750. This figure can be broken down into 11,875 <L, V> (125 L × 95 V) sequences plus 1,626,875 <L, V, T> (125 L × 95 V × 137 T) sequences. In terms of the number of each Jamo, there are 125 L (1100..115F and A960..A97C), 95 V (1160..11A7 and D7B0..D7C6), and 137 T (11A8..11FF and D7CB..D7FB).
The 11,172 Hangul Syllables represent a very small subset of this large figure, and can be broken down into 399 <LV> (19 L × 21 V) sequences plus 10,773 <LVT> (19 L × 21 V × 27 T) sequences. The Hangul Syllables are composed of the following subsets of Jamo: 19 L (1100..1112), 21 V (1161..1175), and 27 T (11A8..11C2). [KL]
Q: How are conjoining Jamo supported in fonts?
While the actual implementation details are beyond the scope of this FAQ, the 'ccmp' (Glyph Composition/Decomposition), 'ljmo' (Leading Jamo Forms), 'vjmo' (Vowel Jamo Forms), and 'tjmo' (Trailing Jamo Forms) OpenType features are used to support conjoining Jamo in fonts. The fonts for the open source Source Han and Noto CJK Pan-CJK typefaces are examples of conjoining Jamo implementations. [KL]
Q: To what extent are conjoining Jamo supported in apps?
While the 'ccmp' (Glyph Composition/Decomposition) OpenType feature is broadly supported in apps, support for the 'ljmo' (Leading Jamo Forms), 'vjmo' (Vowel Jamo Forms), and 'tjmo' (Trailing Jamo Forms) OpenType features is not nearly as broad, but is generally supported by modern browsers. [KL]
Q: Are Hangul Syllables and Jamo ever mixed?
Text that is normalized as NFD will contain only Jamo. Typically, general text that is unnormalized text or normalized to NFC would mostly consist of Hangul syllables. However, Jamo could occur in certain circumstances:
- isolated Jamo
- pre-1933 orthography Korean text
- incomplete Hangul Syllables (for example, syllables without an L as used in dictionaries and grammar books)
- Jamo used for a more faithful phonetic representation of some Korean dialects
In the fourth case, there are two possibilities. If the L or V are non-modern Jamo, then the entire syllable would consist of Jamo. If both L and V are modern Jamo, but the T is non-modern, then the syllable would be represented by a sequence of two characters: a single code point for <LV>, followed by a T: <LV, T>
This is similar to Latin. The NFC form of A + grave + umlaut is <A-grave, umlaut>; the first part is precomposed and the remainder is not. [JS]
Q: Does this make any difference in how a Jamo sequence should be displayed?
Whether a Jamo sequence is represented in the form <L, V>, <LVT>, or <LV, T>, it should be displayed as though it were a Hangul Syllable.
Q: How should non-standard syllables be displayed?
- An L that is not followed by a V should be displayed as if it were the sequence <L, Vf>
- A V that is not preceded by an L should be displayed as if it were the sequence <Lf, V>
- A T that is not preceded by <L, V> or <LV>, should be displayed as if it were the sequence <Lf, Vf, T>
Q: When mapping to KS X 1001 (formerly known as KS C 5601), how should I handle conjoining Jamo?
The easiest approach is to first convert the text using NFC, then convert any remaining conjoining Jamo to compatibility Jamo. For example, U+1100 ᄀ HANGUL CHOSEONG KIYEOK converts to U+3131 ㄱ HANGUL LETTER KIYEOK. Jamo, Lf and Vf, can simply be removed.
Q: Why are the KS X 1001 (and KS C 5601) mapping tables in the Public directory on the Unicode site in an “OBSOLETE” directory?
Those mapping tables are placed in the OBSOLETE directory because they are of historical interest. They may not exactly reflect current mapping implementation practice in all cases. See the Conversions / Mappings FAQ for discussion and alternatives for East Asian legacy character set mapping.