Re: Korean syllable decomposition(was: CJK combining components)

From: Jungshik Shin (jshin@pantheon.yale.edu)
Date: Thu Oct 19 2000 - 00:22:57 EDT


> Jungshik Shin wrote:
> > On Tue, 17 Oct 2000 11digitboy@bolt.com wrote:
> > > So, do they have a table that says "This hangul syllable
> > > is made up of components X, Y, and Z"?) Maybe Unicode
> > > should have one.
> >
> > Well, Unicode will never have one for dynamic glyph composition of Hangul
> > syllables ;-) because there are so many possibilities (how many different
> > sets of glyphs to use for initial consonants, medial vowels and final
> > consonants. The higher quality you want to get, the more sets you need).

> note that in fact the composition and decomposition of
> hangul syllables to and from jamos is algorithmic in unicode
> and does not need a table. you will find all the details at

Well, you and I are talking about two completely different things :-). Do
you really think I'm not aware of what you're talking about as a native
Korean speaker who got 15 yr-long education in Korea where the invention
of Hangul used to be honored by national holiday ( it's October 9th.
Well, I think it smells a bit of nationalistic zeal and I'm not so fond
of some Koerans who believe, without thinking hard, Hangul is the best
script on earth) BTW, the way Hangul syllables is decomposed in Unicode
is not the only way and it could be argued that decomposing some of
complex jamos into even more elemental components (at the extreme, one
may need only a dozen jamos) is desirable as there is at least one vowel
in modern Korean which cannot be represented using Jamos in U1100 block
(which means it cannot be represented with precomposed syllables either).

Anyway, at issue here is just decomposing Hangul syllables into Hangul
Jamos are not enough *by any means* for dynamic composition of glyphs for
Hangul syllables so decomposed unless you're satisfied with non-square
(ugly) glyphs as used by old Korean typewriters and telex machines.

This is even more true of Hanzi/Kanji/Hanja. Just decomposing them into
components is one thing and making *multiple* glyphs for each and every
component and figuring out which glyph to use for a particular component
in a particular Hanzi/Kanji/Hanja character is another. Please note that
a single component can take many different size and 'shape' (of course,
topologically - actually a little bit more - all of multiple glyphs are
equal ) depending on in what character it's used. ( Here's a a random
example - I just opened Unicode 3.0 book. U4595, U4598, U4599, U459A,
and U459C share the radical U2EC1, but the shape of U2EC1 in those
characters are all different. )

It's not trivial to automate this process although I suspect Asian
foundries use some automation in this regard (and then they have to make
a lot of manual refinement...)

It'd have been nice if you had read a little more carefully what I wrote
in the part of my messge you didn't quote (or I could have been a bit
more clear). Please, note that I wrote about 10 **sets* of initial
consonants and likewise *multiple* *sets* of glyphs for medial vowels
and final consonants used by Hanterm. If you still don't understand what
I mean, you may wish to read the source file nsUnicodeToX11Johab.cpp of
Mozilla which you can easily find at www.mozilla.org (the crossreferenced
searchable source code of Mozilla is available there) or better install
one of "X11 Hangul Johab fonts" and view it with xfd in X11. It's at

http://samwise.kaist.ac.kr/hanterm/download/font/hanterm-font-3.1pre3.tar.gz

Alternatively you can install a Korean truetype font (e.g. gulim
included in MS Global IME with the language pack for Korean) and view
the internals of the font. It uses dozens of *sets* of glyphs for initial
consonants alone.

Jungshik Shin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT