Re: Compression through normalization

From: Jungshik Shin (jshin@mailaps.org)
Date: Sat Dec 06 2003 - 03:39:38 EST

Next message: Philippe Verdy: "RE: Compression through normalization"

Previous message: Don Osborn: "Re: Missing African Latin letters"
In reply to: Doug Ewell: "Re: Compression through normalization"
Next in thread: Doug Ewell: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Fri, 5 Dec 2003, Doug Ewell wrote:

> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>
> > Still in the same subject, how do the hold KSX standards for Han[g]ul
> > compare each other? If they are upward compatible, ans specify that
> > the conversion from an old text not using compound letters to the new
....
> > In that case Unicode will not treat them as canonically equivalent,
> > despite they would have been considered equivalent in the Korean
> > standards. So we will find various data containing precomposed jamos
> > for the johad set, and other syllables not using them.

> This is not a Unicode problem,

I fully agree with Doug that it's NOT a Unicode problem but a problem
that has to be dealt with while converting legacy data to Unicode.

> Put another way, when converting strictly from a standard such as KS X
> 1001 where the consonants are not differentiated as to choseong vs.
> jongseong, the jamos will be converted to the Unicode compatibility
> characters in the U+31xx block, not the "real" Hangul Jamos block at
> U+11xx. They will thus not be canonically equivalent to either the
> U+11xx jamos or the precomposed syllables.

> I don't see any reason why a reasonably smart conversion program can't
> convert legacy-encoded "generic" consonants into Unicode's segregated
> choseong and jongseong, based on the same principles that make two-set
> keyboards workable.

Actually, there's a provision (not implemented widely) in KS X 1001
that allows KS X 1001-based character encodings like EUC-KR and
ISO-2022-KR to represent 8,822 syllables (not listed in KS X 1001 as
precomposed forms), incomplete syllables that begin with 'filler'
and isolated leading consonants, vowels, and trailing consonants
in 8byte sequences. See the CJK section of the Unicode FAQ at
http://www.unicode.org/faq/han_cjk.htm

Q: When mapping to KS X 1001-based MBCS character encodings, how should
I map the 8,822 Unicode Hangul syllables not covered by KS X 1001?

IMHO, sequences that don't fit this 8byte sequence pattern should be
just converted to Hangul compatibility Jamos 'verbatim'.

Jungshik

Next message: Philippe Verdy: "RE: Compression through normalization"
Previous message: Don Osborn: "Re: Missing African Latin letters"
In reply to: Doug Ewell: "Re: Compression through normalization"
Next in thread: Doug Ewell: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Dec 06 2003 - 04:35:16 EST