Re: CJK stroke order data: kRSUnicode v. kRSKangXi

From: Richard COOK <rscook_at_unicode.org>
Date: Mon, 10 Mar 2014 08:39:19 -0700

Mr. Nohejl,

About the property data you mention below. kRSUnicode property data permits multiple/variant (space-delimited) radical/stroke values, and I think we will see important variants added in the future. Where a specific value attested in a specific Kangxi edition is missing from kRSUnicode, it would indeed be useful to add it, and perhaps to give it priority (move it to the front of the list). Likewise, if a common variant value is missing (even one not associated with Kangxi), it might be added for convenience. And if there are any outright errors, of course those should be identified and corrected (but clear errors are harder to find these days).

Note that because kRSUnicode covers *all* Unihan CJK, even those characters not present in the original Kangxi, some of the radical/stroke values are so-called "virtual" assignments (those should be omitted from consideration, in proofing original KX data).

Several years ago we (at Wenlin.com) produced consolidated Kangxi data for our Zidian (Wenlin 4.X), taking these four properties (among other data) as input:

<http://www.unicode.org/reports/tr38/#kIRGKangXi>
<http://www.unicode.org/reports/tr38/#kKangXi>
<http://www.unicode.org/reports/tr38/#kRSKangXi>
<http://www.unicode.org/reports/tr38/#kIRG_GSource>

The last of these may not have any obvious connection with Kangxi, until one reads the kIRG_GSource property description and sees this "sub-property" description:

"GKX Kangxi Dictionary ideographs (康熙字典) 9th edition (1958) including the addendum (康熙字典)補遺"

PRC researchers have done much work proofing G-Source Kangxi data, to address many aspects of the complex original text.

The Kangxi work we did at Wenlin has several dimensions, and some of this has not yet rippled back into UCD.

We have in fact already identified many important omissions from kRSUnicode, which we plan to propose for a future data release.

Since kRSUnicode is a Normative property, a formal proposal to modify that data is required, for review in WG2. I have added notes on the items you mention below, for consideration in that process, and in the meantime, if you identify any other issues, please bring them to our attention.

-Richard

PS: About the subject line of your message. Please note that despite the "CJK stroke order" subject line in your message, we are not talking about CJK stroke order here at all, but about Kangxi and UCS radical assignment, and residual stroke *count* data. Such data can indeed be used to "order" (collate) CJK data, but "stroke order" is a separate issue, involving the particular sequence of CJK Strokes (see The Unicode Standard, Appendix F) in the writing of a given character (stroke-order data can also be used for collation and indexing). Wenlin's CDL database (which inspired the CJK Stroke block, and also produced Appendix F) contains a comprehensive analysis of CJK Stroke order *and* Radical/Stroke data for all UCS CJK, primarily focused on PRC norms, but also including a great many variants (variants forms, variant stroke counts, and variant radical assignments).

On Feb 28, 2014, at 10:56 AM, Adam Nohejl wrote:

>
> (1) A very common character for "most, maximum".
> 最[U+6700] kRSKangXi 73.8
> 最[U+6700] kRSUnicode 13.10
>
> (2) A funny character for autumn containing the turtle component.
> 龝[U+9F9D] kRSKangXi 115.16
> 龝[U+9F9D] kRSKanWa 115.16
> 龝[U+9F9D] kRSUnicode 213.5
>
> There are also characters that actually are not included in the Kang Xi dictionary**, but the Unihan data contain both a purported Kang Xi radical and in addition to that a _different_ Unicode radical.
>
> (3) The simplified turtle character (commonly assigned to the traditional radical #213):
> 亀[U+4E80] kRSKangXi 213.0
> 亀[U+4E80] kRSUnicode 5.10
>
> (4) Character with the radical #72/73 at the top, i.e. IMHO an arbitrary decision, but unexpectedly the fields differ:
> 曻[U+66FB] kRSKangXi 72.7
> 曻[U+66FB] kRSUnicode 73.7

> Hello,
>
> I am comparing radical data for CJK characters from different sources, including the Unihan database. According to the Unihan documentation* the kRSUnicode radical should correspond to kRSKangXi radical, which in turn should be based on the Kang Xi dictionary.
>
> Is there any explanation for the following discrepancies? Did I miss any other rules or reasoning behind the content of these two fields?
>
> Examples of the discrepancies:
>
> (1) A very common character for "most, maximum".
> U+6700 kRSKangXi 73.8
> U+6700 kRSUnicode 13.10
>
> (2) A funny character for autumn containing the turtle component.
> U+9F9D kRSKangXi 115.16
> U+9F9D kRSKanWa 115.16
> U+9F9D kRSUnicode 213.5
>
> There are also characters that actually are not included in the Kang Xi dictionary**, but the Unihan data contain both a purported Kang Xi radical and in addition to that a _different_ Unicode radical.
>
> (3) The simplified turtle character (commonly assigned to the traditional radical #213):
> U+4E80 kRSKangXi 213.0
> U+4E80 kRSUnicode 5.10
>
> (4) Character with the radical #72/73 at the top, i.e. IMHO an arbitrary decision, but unexpectedly the fields differ:
> U+66FB kRSKangXi 72.7
> U+66FB kRSUnicode 73.7
>
> - - -
>
> [*] <http://www.unicode.org/reports/tr38/tr38-8.html>: "Property: kRSUnicode // Description: (...) The first value is intended to reflect the same radical as the kRSKangXi field and the stroke count of the glyph used to print the character within the Unicode Standard."
>
> [**] The two characters are missing from the '89 edition of Kang Xi (which should be the same as used for Unihan) according to search on this site: <http://ctext.org/dictionary.pl>
>
>
> --
> Adam Nohejl
>
>
> _______________________________________________
> Unicode mailing list
> Unicode_at_unicode.org
> http://unicode.org/mailman/listinfo/unicode

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Mon Mar 10 2014 - 10:40:39 CDT

This archive was generated by hypermail 2.2.0 : Mon Mar 10 2014 - 10:40:40 CDT