Re: Korean kugyol

From: Thomas Chan (thomas@atlas.datexx.com)
Date: Wed Nov 29 2000 - 20:05:08 EST


On Tue, 28 Nov 2000, Tom Emerson wrote:

> Thomas Chan writes:
> > I'd like to ask what the rationale is for including Korean kugyol as a
> > subset of CJK Ideographs in Unicode, while Chinese bopomofo and Japanese
> > katakana are treated as distinct from their CJK Ideograph origins and
> > look-alikes.
>
[snipped]
>
> > I can think of a few possible reasons why Chinese bopomofo and Japanese
> > katakana have been treated as distinct from CJK Ideographs, such as 1)
> > distinguished in source legacy CJK character sets: 2) not included in
> > Chinese and Japanese character dictionaries; 3) technically capable of
> > being used in the absence of CJK Ideographs as a complete script; 4) used
> > solely for phonetic value; 5) in widespread contemporary use, so
> > regular people care for the distincton.
>
> (1) hits the nail on the head, as far as I am concerned. None of the
> current Korean character set standards that I'm aware of include
> the kugyol: in particular the code points you mention in your
> message are not found in KSX 1001.

I believe I have just found the source of the kugyol in the CJK
Ideographs block--the G1 source, described in the beginning of the
UNIHAN.TXT file as "GB12345-90 with 58 Hong Kong and 92 Korean 'Idu'
characters".

All of the records in the UNIHAN.TXT file which have the term "kwukyel" in
the kDefinition field also have a kIRG_GSource value for G1 with a EUC
first byte of 7Dh (row 93d in "kuten/quwei" notation). There are also
records with a EUC first byte of 7Ch (row 92d). Both of these are beyond
GB 12345, which ends with 79h (kuten/quwei row 89d), so they must be the
tacked-on "58 Hong Kong and 92 Korean 'Idu' characters".

Looking closely, one can see that there are 150 characters in the range
range 7C21h .. 7D7Eh. However, the boundary between the "Hong Kong" and
the "Korean 'Idu'" ones seems to be between 7C56h and 7C57h, which does
not neatly divide up the 150 into 58 and 92; rather, it seems to be a
55/95 split. (counting error?)

The set marked as "Hong Kong" ones are clearly a partial collection of
additional characters used in writing Cantonese, and thus not technically
limited to "Hong Kong" per se. The set marked as "Korean 'Idu'", while
including the archaic kugyol, also include some Korean "national
characters" (akin to "kokuji" in Japan), so there is misuse of the term
"Idu" here. And at least 7D6Eh (U+56CD) 'double happiness' is not even
specific to Korean (language), as it is used in Chinese (though not listed
in any Chinese dictionary that I know of), so the best that can be said of
the latter set is that the submissions are from Korea (?).

None of the CJK Ideographs with a kDefinition field of "kwukyel" occur in
KS X 1001 (formerly KS C 5601), and only a few are in the K2 source,
"PKS C 5700-1 1994", which is not one of the sources for the original CJK
Ideographs block (see table 10-1 on p. 259 of TUS 3.0), anyway.

So it looks like it was mainland China's call to first treat kugyol as a
subset of CJK Ideographs, and we are stuck with it.

> (2) isn't true: these are in the Hanyu Dacidian and the Kangxi
> dictionaries. I have not been able to find the code points you mention
> in the Korean Hanja dictioanries that I have.

I haven't looked in the _Hanyu Da Cidian_, but I'm pretty sure Bopomofo
are not in the _Kangxi Zidian_ as they were created in the early 20th
century, way after the publication of the _Kangxi Zidian_ in 1716, except
in cases where the Bopomofo letter has a identical form with a character;
the principle behind creating Bopomofo being to re-use
geometrically-simple characters.

If one can't find those kugyol in Korean character dictionaries, then I
would take that as possible (but not conclusive) evidence that they are
not considered characters.

 
> (3), (4), and (5) are irrelevant.
>
> So, by the source separation rule, the presence of these characters in
> the source standards (various Chinese and Japanese dictionaries) as
> well as in existing character sets such as Big 5+ where these
> characters are part of the generic ideograph rows, speaks to having
> them in the Unified Ideograph Blocks.

I'm not sure what you mean here. Source separation ensures separate
codepoints in Unicode, but it doesn't say anything about what was grouped
together in the legacy character set, nor what will be grouped together in
Unicode. e.g., CNS 11643-1986, which was used in compiling the original
CJK Ideographs block, had some candidates in kuten/quwei row 2d,
starting with column 88d, which ended up in Unicode's CJK Ideographs
block, even though in CNS 11643-1986 they were nowhere near the "hanzi"
starting with row 7d. (They were "hanzi" used in measurements.) The
so-called "Hangzhou numerals" were not grouped with the "hanzi" in CNS
11643, but one to nine ended up in Unicode's CJK Symbols and Punctuation
block, while ten, twenty, and thirty sat in limbo (see the CNS11643.TXT
and BIG5.TXT files for their comments) unmapped--maybe they could have
become CJK compatibility characters for the CJK Ideographs block. (It
looks like in the end they ended up in the CJK Symbols and Punctuation
block in Unicode 3.0.) Even if the G1 source mixed kugyol in with
characters, there's no reason why Unicode absolutely had to put them in
the CJK Ideographs block (but they did).

As for Big5+, it couldn't have played a role, since it didn't exist until
1997.

Thomas Chan
tc31@cornell.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT