Re: Compression through normalization

From: Doug Ewell (
Date: Sat Dec 06 2003 - 02:20:51 EST

  • Next message: Doug Ewell: "Re: Compression through normalization"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > Still in the same subject, how do the hold KSX standards for Han[g]ul
    > compare each other? If they are upward compatible, ans specify that
    > the conversion from an old text not using compound letters to the new
    > standard does not mandate their composition into compund jamos, as
    > they are considered equivalent there, then there's an issue if a text
    > is converted from the old standard set of jamos to Unicode or first
    > converted to the new set and then to Unicode.
    > In that case Unicode will not treat them as canonically equivalent,
    > despite they would have been considered equivalent in the Korean
    > standards. So we will find various data containing precomposed jamos
    > for the johad set, and other syllables not using them.

    Put another way, when converting strictly from a standard such as KS X
    1001 where the consonants are not differentiated as to choseong vs.
    jongseong, the jamos will be converted to the Unicode compatibility
    characters in the U+31xx block, not the "real" Hangul Jamos block at
    U+11xx. They will thus not be canonically equivalent to either the
    U+11xx jamos or the precomposed syllables.

    This is not a Unicode problem, however. The compatibility jamos *could
    not* have been canonical equivalents to the standard jamos, at least
    partly because of the lack of differentiation between L and T
    consonants. For example, U+3131 HANGUL LETTER KIYEOK could not possibly
    be equivalent to both U+1100 HANGUL CHOSEONG KIYEOK *and* U+11A8 HANGUL

    Unicode can only be responsible for establishing equivalences within
    Unicode, not within other character encodings. Unicode isn't even
    responsible for the accuracy or reasonableness of mappings from other
    standards, although they try.

    > So, unification of these strings will require rearranging the jamos.
    > This is an issue for converters, and still an issue within Unicode as
    > single-letter jamaos are not deprecated and in fact are necessary for
    > Modern Hangul (they are not "compatibility character" and participate
    > to the composition of johab syllables for the determination of
    > canonical equivalence).
    > If your compressor or transcoder is not allowed to perform any
    > rearrangement of jamos for modern Hangul, it should be relaxed for
    > legacy data where jamos should have been preferably precomposed before
    > being converted to Unicode. Such data will continue to persist for a
    > long time, because it seems so easy for a Korean writer to insert or
    > delete a missing single-letter jamos when performing corrections (or
    > because of an initial missing keystroke in the input method used to
    > compose the text initially).

    I don't see any reason why a reasonably smart conversion program can't
    convert legacy-encoded "generic" consonants into Unicode's segregated
    choseong and jongseong, based on the same principles that make two-set
    keyboards workable. This shouldn't be a conformance problem, because
    the legacy encodings aren't Unicode.

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Sat Dec 06 2003 - 03:13:17 EST