Re: decomposable Hangul jamos (was: Compression through normalization)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Dec 03 2003 - 12:12:50 EST

Next message: Peter Constable: "RE: Free Fonts"

Previous message: Peter Constable: "RE: MS Windows and Unicode 4.0 ?"
In reply to: Doug Ewell: "Re: Compression through normalization"
Next in thread: Jungshik Shin: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell writes:
> > I just have another question for Korean: many jamos are in fact
> > composed from other jamos: this is clearly visible both in their name
> > and in their composed glyph. What would be the linguistic impact of
> > decomposing them (not canonically!)? Do Korean really learn these
> > jamos without breaking them into their components? I think here about
> > SSANG (double) consonnants, or the initial Y or final E of some
> > vowels...
>
> This would be a good question for Jungshik or another native Korean. I
> have read that Korean children learn the syllables as whole units,
> rather than as an arrangement of jamos as I would see them, leading some
> to think of Hangul as a featural syllabary instead of an alphabet.

The interesting part of this question is that Unicode allows Hangul
syllables of the form L+L+V and L+V, which can sometime represent
exactly the same abstract Korean grapheme cluster.

For example the <SSANKIYEOK CHOSEONG> leading consonnant (L) is normally
decomposable as <KIYEOK CHOSEONG, KIYEOK CHOSEONG> (L+L) which would be
interpreted in Unicode as being in the same Korean syllable, and thus
rendered as a single (and probably identical) grapheme cluster.

However, Unicode does not handle this decomposition as canonically
equivalent, and not even compatibility equivalent. So this may leave
some place for additional folding operations for searches, which may
be needed if some legacy charset was used to encode a text without
the current precomposed (and currently not decomposable) double
consonnants or double vowels.

Mapping these simpler charsets, where the presence of a more complex
character layout engine to render syllables was assumed the same way
that Unicode assumes a composition engine for LV or LVT syllables if
they are not directly implemented as distinct glyphs in Hangul fonts,
could require such complex design choice for the mapping converter:

Should the converter recognize double vowels and double consonnants
in the legacy 8-bit charset as candidate for composition into a single
Unicode jamo instead of two?

Using two Unicode jamos would allow better interoperability with texts
generally encoded with KSC5601 or Unicode. But this would break things
if the compatibility mapping was not reversible.

But nothing seems to forbid the mapping to separate Unicode jamos
(thus excluding mapping to the "ligatured" double vowels or double
consonnants encoded as undecomposable jamos in Unicode), to preserve
an exact bijective mapping to/from that legacy mapping using more basic
leading consonnant or trailing consonnant or vowel jamos.

I think that you could even imagine a encoding where the distinction
between leading and trailing consonnants is not made, assuming the
(unmarked) phonology of Korean to recognize syllables, exactly the
same way as it is done in Latin (with hyphenation dictionnaries), or
using a _marked_ syllable break (mapped for example as ZWNJ in Unicode).

Similar questions happen with Unicode text using "defective" Hangul
syllables (for example just V+T or T) sometimes made less defective
by marking the missing L or V jamos with explicit Lf or Vf fillers
as <Lf,V,T> or <Lf,Vf,T> which cannot be composed today.
The interesting case is <Lf,Vf,T> which will be noramlly rendered
exactly as if it was a single <L> jamo, so a 8bit charset may simply
choose to not encode the difference between leading and final
consonnants if they are rendered the same, and if no filler is used
in the 8bit mapping.

In that case, the 8bit mapping will really have the effect of
representing Hangul as a true alphabet, exactly similar to the
Latin alphabet with simple vowels and consonnants, and ligatures
created on the fly to create the printed syllables, using the
horizontal and vertical composition rule inherent to that script
for representing only graphically the syllables.

In reality, the Hangul script seems to be really an alphabet that
marks explicitly in the printed form the separation of effective
syllables (as if we had to use a SHY between each syllable in
the Latin script to print Latin text correctly).

And neither the "johab" subset chosen by Unicode, not even the
"choseong"/"jungseong" and "jongseong" subsets represent correctly
the inherent structure of the Hangul script.

That's why I am wondering if Korean children are really learning
the jamos the way they are shown (with ligatures) in Unicode, or
if they don't simply learn to recognize the non ligated forms in
the bidimensional syllable layout. In that case, the script is much
simpler to learn, as it has much less letters than what can be seen
in Unicode. Isn't Unicode making a unnecessarily too complex
representation of Hangul jamos?

__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com

application/ms-tnef attachment: winmail.dat

Next message: Peter Constable: "RE: Free Fonts"
Previous message: Peter Constable: "RE: MS Windows and Unicode 4.0 ?"
In reply to: Doug Ewell: "Re: Compression through normalization"
Next in thread: Jungshik Shin: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 13:11:08 EST