RE: Compression through normalization

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Dec 03 2003 - 17:30:35 EST

  • Next message: Philippe Verdy: "RE: Free Fonts"

    Jungshik Shin writes:
    > On Wed, 3 Dec 2003, Philippe Verdy wrote:
    >
    > > I just have another question for Korean: many jamos are in fact composed
    > > from other jamos: this is clearly visible both in their name
    > and in their
    > > composed glyph. What would be the linguistic impact of
    > decomposing them (not
    > > canonically!)? Do Korean really learn these jamos without
    > breaking them into
    > > their components?
    >
    > The Korean alphabet invented in 1443 and announced in 1446 included 17
    > consonants and 11 vowels. Modern Korean uses 14 consoants and 10 vowels
    > (3 consonants and 1 vowel have become obsolete.

    Very interesting, as it exhibits my feeling that the Hangul script could
    have been encoded completely as an alphabet in only 2 columns including
    special symbols and punctuations.

    This conforms to a encoding model that I had seen a dozen of years ago
    about the encoding of Chinese and Korean with very reduced code sets,
    using a separate complex but implementable set of composition rules
    that would have allowed an easy integration within existing 8 bit or
    even 7 bit technologies (for example in Teletex). I think that this
    work was performed for a candidate ETSI standard (but my memory can
    fail here) to be used in TV set decoders.

    When I had read these research papers, they demonstrated that the
    Han and Hangul scripts were much less complex at the abstract level
    than the way they appear in their written composed form. And that,
    depending on the composition capability of the renderer (or of the
    screen resolution), a linear decomposed representation was still
    possible and still readable, possibly by using visible composition
    symbolic glyphs (for Han, I can remember some alternate presentation
    forms based on linearized radicals, that could have fitted with
    low resolution devices, giving results quite similar to the approach
    used to approximate composition of Latin with spacing rather than
    non spacing diacritics such as "a`" instead of "à", something which
    is much better and more user-friendly than showing a null glyph for
    unsupported composed characters).

    Even today, this analysis of the Hangul script at the very abstract
    level helps creating convenient input methods (your count of basic
    letters shows that it becomes very easy to map these letters on
    keyboards, without needing complex to learn input methods, as the
    input editor can process the input basic letters into standard decomposed
    jamos or into precomposed johab syllables.

    The other interest is that it effectively allows efficient search
    and indexing algorithms in Hangul texts, by allowing matches below
    the level of jamos currently composed in Unicode.

    > Korean 'ABC-song' enumerates them only (i.e. it doesn't include
    > cluster/complex letters.)

    That's a good proof that children can learn the Hangul script by
    recognizing this very small set of letters, separately from the
    2D layout used to make them fit in a single syllable glyph,
    something that is typically entered by pressing the spacebar
    between syllables to render the composed grapheme cluster.

    > > I think here about SSANG (double) consonnants, or the initial Y
    > > or final E of some vowels...
    > > Of couse I won't be able to use such decomposition in Unicode,
    > > but would it be possible to use it in some private encoding
    > > created with a m:n charset mapping from/to Unicode?
    >
    > That kind of composition/decomposition is necessary for linguistic
    > analysis of Korean. Search engines (e.g. google), rendering engines
    > and incremental searches also need that.

    Unicode has promoted the use of decompositions for Latin, Greek and
    Cyrillic, but it's a shame that it was not done for Hangul, and that
    multiple design errors have been made, which now are immutable due
    to the stability policy.

    Now,if a IDNA system is to include Hangul domain names, I do think
    that these names should need to be reserved in bundles matching
    more strings than just the Unicode canonical equivalents or even
    compatibility equivalents. Additional decompositions will be needed.

    The same thing will also be necessary in orthographic correctors
    used in word processors. You point that Search engines do need this
    too... This adds to the discussion about the best encoding to use
    to parse Hangul texts: it's probable that extended decompositions
    will allow matching equivalent text more precisely, and their
    recomposition into optimized Unicode jamos or johab syllables
    can be automated within editors. Other candidate compositions
    could be looked up also within Hangul compatibility syllables so
    that the Korean text will compress much better than it is now with
    just NFC compositions.

    I already have several applications of "custom" decompositions needed
    to parse text that are not solved today with NFC/NFD or even NFKD/NFKC
    and this may be a place where Unicode should provide support by
    defining a new set of extended decompositions (not to be used for
    normalized forms as they are now stabilized for the best or the
    worst) for correct text parsing in various languages using these
    scripts. It won't be up to ISO/IEC 10646 to define these decompositions
    as it is not their work of defining properties, but only to include
    and unify existing repertoires.

    If needed, for linguistic processing, we'll see that some characters
    should be decomposed into characters that are still not encoded in
    the ISO10646 character repertoire, but that's something that could
    be integrated in future revisions (so that Unicode can refine later
    the extended decompositions).

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 18:20:19 EST