Re: Compression through normalization

From: Jungshik Shin (jshin@mailaps.org)
Date: Sat Dec 06 2003 - 03:39:38 EST

  • Next message: Philippe Verdy: "RE: Compression through normalization"

    On Fri, 5 Dec 2003, Doug Ewell wrote:

    > Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
    >
    > > Still in the same subject, how do the hold KSX standards for Han[g]ul
    > > compare each other? If they are upward compatible, ans specify that
    > > the conversion from an old text not using compound letters to the new
    ....
    > > In that case Unicode will not treat them as canonically equivalent,
    > > despite they would have been considered equivalent in the Korean
    > > standards. So we will find various data containing precomposed jamos
    > > for the johad set, and other syllables not using them.

    > This is not a Unicode problem,

     I fully agree with Doug that it's NOT a Unicode problem but a problem
    that has to be dealt with while converting legacy data to Unicode.

    > Put another way, when converting strictly from a standard such as KS X
    > 1001 where the consonants are not differentiated as to choseong vs.
    > jongseong, the jamos will be converted to the Unicode compatibility
    > characters in the U+31xx block, not the "real" Hangul Jamos block at
    > U+11xx. They will thus not be canonically equivalent to either the
    > U+11xx jamos or the precomposed syllables.

    > I don't see any reason why a reasonably smart conversion program can't
    > convert legacy-encoded "generic" consonants into Unicode's segregated
    > choseong and jongseong, based on the same principles that make two-set
    > keyboards workable.

      Actually, there's a provision (not implemented widely) in KS X 1001
    that allows KS X 1001-based character encodings like EUC-KR and
    ISO-2022-KR to represent 8,822 syllables (not listed in KS X 1001 as
    precomposed forms), incomplete syllables that begin with 'filler'
    and isolated leading consonants, vowels, and trailing consonants
    in 8byte sequences. See the CJK section of the Unicode FAQ at
    http://www.unicode.org/faq/han_cjk.htm

      Q: When mapping to KS X 1001-based MBCS character encodings, how should
      I map the 8,822 Unicode Hangul syllables not covered by KS X 1001?

    IMHO, sequences that don't fit this 8byte sequence pattern should be
    just converted to Hangul compatibility Jamos 'verbatim'.

      Jungshik



    This archive was generated by hypermail 2.1.5 : Sat Dec 06 2003 - 04:35:16 EST