Re: decomposable Hangul jamos (was: Compression through normalization)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Dec 03 2003 - 12:12:50 EST

  • Next message: Peter Constable: "RE: Free Fonts"

    Doug Ewell writes:
    > > I just have another question for Korean: many jamos are in fact
    > > composed from other jamos: this is clearly visible both in their name
    > > and in their composed glyph. What would be the linguistic impact of
    > > decomposing them (not canonically!)? Do Korean really learn these
    > > jamos without breaking them into their components? I think here about
    > > SSANG (double) consonnants, or the initial Y or final E of some
    > > vowels...
    >
    > This would be a good question for Jungshik or another native Korean. I
    > have read that Korean children learn the syllables as whole units,
    > rather than as an arrangement of jamos as I would see them, leading some
    > to think of Hangul as a featural syllabary instead of an alphabet.

    The interesting part of this question is that Unicode allows Hangul
    syllables of the form L+L+V and L+V, which can sometime represent
    exactly the same abstract Korean grapheme cluster.

    For example the <SSANKIYEOK CHOSEONG> leading consonnant (L) is normally
    decomposable as <KIYEOK CHOSEONG, KIYEOK CHOSEONG> (L+L) which would be
    interpreted in Unicode as being in the same Korean syllable, and thus
    rendered as a single (and probably identical) grapheme cluster.

    However, Unicode does not handle this decomposition as canonically
    equivalent, and not even compatibility equivalent. So this may leave
    some place for additional folding operations for searches, which may
    be needed if some legacy charset was used to encode a text without
    the current precomposed (and currently not decomposable) double
    consonnants or double vowels.

    Mapping these simpler charsets, where the presence of a more complex
    character layout engine to render syllables was assumed the same way
    that Unicode assumes a composition engine for LV or LVT syllables if
    they are not directly implemented as distinct glyphs in Hangul fonts,
    could require such complex design choice for the mapping converter:

    Should the converter recognize double vowels and double consonnants
    in the legacy 8-bit charset as candidate for composition into a single
    Unicode jamo instead of two?

    Using two Unicode jamos would allow better interoperability with texts
    generally encoded with KSC5601 or Unicode. But this would break things
    if the compatibility mapping was not reversible.

    But nothing seems to forbid the mapping to separate Unicode jamos
    (thus excluding mapping to the "ligatured" double vowels or double
    consonnants encoded as undecomposable jamos in Unicode), to preserve
    an exact bijective mapping to/from that legacy mapping using more basic
    leading consonnant or trailing consonnant or vowel jamos.

    I think that you could even imagine a encoding where the distinction
    between leading and trailing consonnants is not made, assuming the
    (unmarked) phonology of Korean to recognize syllables, exactly the
    same way as it is done in Latin (with hyphenation dictionnaries), or
    using a _marked_ syllable break (mapped for example as ZWNJ in Unicode).

    Similar questions happen with Unicode text using "defective" Hangul
    syllables (for example just V+T or T) sometimes made less defective
    by marking the missing L or V jamos with explicit Lf or Vf fillers
    as <Lf,V,T> or <Lf,Vf,T> which cannot be composed today.
    The interesting case is <Lf,Vf,T> which will be noramlly rendered
    exactly as if it was a single <L> jamo, so a 8bit charset may simply
    choose to not encode the difference between leading and final
    consonnants if they are rendered the same, and if no filler is used
    in the 8bit mapping.

    In that case, the 8bit mapping will really have the effect of
    representing Hangul as a true alphabet, exactly similar to the
    Latin alphabet with simple vowels and consonnants, and ligatures
    created on the fly to create the printed syllables, using the
    horizontal and vertical composition rule inherent to that script
    for representing only graphically the syllables.

    In reality, the Hangul script seems to be really an alphabet that
    marks explicitly in the printed form the separation of effective
    syllables (as if we had to use a SHY between each syllable in
    the Latin script to print Latin text correctly).

    And neither the "johab" subset chosen by Unicode, not even the
    "choseong"/"jungseong" and "jongseong" subsets represent correctly
    the inherent structure of the Hangul script.

    That's why I am wondering if Korean children are really learning
    the jamos the way they are shown (with ligatures) in Unicode, or
    if they don't simply learn to recognize the non ligated forms in
    the bidimensional syllable layout. In that case, the script is much
    simpler to learn, as it has much less letters than what can be seen
    in Unicode. Isn't Unicode making a unnecessarily too complex
    representation of Hangul jamos?

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 13:11:08 EST