Re: Discrepancies between kTotalStrokes and kRSUnicode in the Unihan database

From: Andrew West <andrewcwest_at_gmail.com>
Date: Tue, 9 Sep 2014 16:57:16 +0100

Hi John,

You raise some interesting points, and I hope that one of the people
who maintain the Unihan database can address your issues better than I
can.

I think that the reason why the main CJK block shows the greatest
number of mismatches between kTotalStrokes and kRSUnicode is related
to the way that CJK characters were ordered in the initial Unicode 1.0
repertoire, which seems to have been based on the glyph shape used in
the particular source standard. To take U+5040 偀 as an example, this
is mainly a Cantonese character, and I guess that the source from
which the Unicode character was derived used a traditional form of the
character with a broken grass radical, giving a residual stroke count
of 9, and it was thus ordered in the code charts as the first
character with nine strokes under radical 9, hence kRSUnicode = 9.9
(you can see in the Unicode 2.0 code charts
<http://www.unicode.org/versions/Unicode2.0.0/CodeCharts2.pdf> that
U+5040 indeed has a 4-stroke grass radical). In a later version of
Unicode a new font was used in which U+5040 was represented in the
(single column) code charts with a 3-stroke grass radical glyph (see
for example the Unicode 4.0 code charts
<http://www.unicode.org/versions/Unicode4.0.0/CodeCharts.pdf>).
kTotalStrokes was presumably based on the glyph forms given in one of
these later versions of the code charts, and so for U+5040
kTotalStrokes = 10.

The problem of stroke counts is now compounded by the use of
multi-column code charts for CJK, with each character illustrated with
multiple regional glyph forms. In many cases different glyph forms
with differing stroke counts are shown in different columns for the
same character, so the kTotalStrokes and kRSUnicode fields may not
reflect the stroke count for all regional variants of the same
character. Furthermore, when regional variants of the same character
do have varying stroke counts it is not obvious which character form
should be used to calculate the values of kTotalStrokes and
kRSUnicode, which makes these two fields very problematic in my
opinion.

That kRSUnicode allows for multiple values, but only provides more
than one value in a tiny handful of cases (mostly where the character
can be classified under more than one radical), makes the situation
even worse in my opinion. For processes that want to sort CJK
characters, it is very useful to have a single nominal radical-stroke
key for every encoded CJK character, but once you have multiple values
for kRSUnicode (and no indication which value is preferable under
which circumstances) then you are given a choice as to which value to
use but no way of knowing which the best choice is.

My solution would be to have a single kRSUnicode value giving a
nominal radical-stroke value for each character, harmonized with
kTotalStrokes, with stroke count for the two fields calculated
consistently according to some defined criteria; and if there are more
than one possible radicals for a particular character then just use
the radical under which it appears in the Unicode code charts. In
addition I would create individual kTotalStrokes and kRSUnicode
fields for each source (G, H, J, K, T, U, V, etc.), which would give
the preferred radical and stroke count for each regional glyph form
given in the code charts.

Andrew

On 8 September 2014 20:03, John Armstrong <john.armstrong.rn3_at_gmail.com> wrote:
> [Apologies if this issue has already been resolved. I searched the
> Unicode.org site for discussions but I only found document dating from 2003
> which touches on the issue: andrewcwest_at_alumni.princeton.edu RE: Unicode
> 4.0.1 Beta Review 1. kRSUnicode Field
> (http://www.unicode.org/L2/L2003/03311-errata4.txt)]
>
>
>
> A CJK Han character is conventionally viewed as consisting of a radical plus
> a residual part or “phonetic”. (For a character which is a radical the
> residual part is nothing. The term “phonetic”, indicating that the residual
> part of the character points the pronunciation of the character, properly
> only applies to 90-95% of characters, but it applies in the examples below.
> )
>
>
>
> The two parts of a character each consist of a specific arrange of strokes,
> and together account for all the strokes in the character. In particular,
> the number of strokes in the radical portion plus the number of strokes in
> the residual portion equals the total number of strokes in the character.
> The stroke count of a radical combined with a residual part is not always
> the same as the stroke count of the radical appearing on its own, but may be
> slightly or significantly less due to a minor or major abbreviation. (A
> radical may have several forms which are used in different positions of the
> whole character, say left or right side vs. top or bottom. These variants
> may have the same or different stroke counts.)
>
>
>
> Because of abbreviated variants the total stroke count for a character
> cannot be always be gotten by adding the stroke count of the radical in its
> standalone form to the stoke count of the residual portion. However, it can
> always be gotten by subtracting the stroke count of residual portion from
> the total stroke count of the character. The Unihan database provides the
> exact data needed to make this calculation:
>
>
>
> kTotalStrokes: stroke count for full character
>
>
>
> kRSUnicode: radical number and residual stroke count (in format
> <rad_num>[‘].<res_strokes>, where optional ‘ (apostrophe) in the latter
> indicates a widely used abbreviation for the radical with a significantly
> different appearance and a significantly (-3 or more) lower stroke. (But
> not all such forms are so marked – examples are forms with radical numbers
> 140, 162,163,170. It may be that the marker is limited to abbreviations
> uses in Simplified as opposed to Traditional Chinese characters.)
>
>
>
> The formula is simply:
>
>
>
> radStrokes(K) = kTotalStrokes(K) - kRSUnicode(K).resStrokes
>
>
>
> This formula generally gives correct results, but not always. In fact,
> according to reasonably accurate heuristic test I ran it produces incorrect
> (or at least “suspicious”) results in 2236 of the total 74911, or 3%, of
> characters in the database that have both kTotalStrokes and kRSUnicode data.
> Moreover the rate is significantly higher for the characters in the BMP than
> in the SIP – in fact it is really negligible in the latter. Most
> importantly it is 8.2% in the block containing all the most widely used
> characters, the base CJK Unified Ideographs block. The numbers for all the
> blocks are as follows:
>
>
>
> RANGE TOTAL* SUSPICIOUS PCT
>
>
>
> BMP
>
> BASE 20941 1727 8.2
>
> CMP 302 29 9.6
>
> CMPS 4 0 0.0
>
> EXTA 6582 469 7.1
>
>
>
> SIP
>
> EXTB 42711 6 0.0
>
> EXTC 4149 5 0.1
>
> EXTD 222 0 0.0
>
>
>
> TOTAL 74911 2236 3.0
>
>
>
> *with both kTotalStrokes and kRSUnicode
>
>
>
>
>
> Some of the suspicious cases are actually valid, but I believe that vast
> majority are truly incorrect, and that the rate of incorrect radical stroke
> counts implied by kTotalStrokes and kRSUnicode is at least 6-7% for the base
> CJK Unified Ideographs block.
>
>
>
>
>
>
>
> Here are a couple examples where the stroke counts are fairly small and the
> radicals and the residual parts (“phonetics”) widely occurring. The first
> illustrates the situation where the radical stroke count implied by
> kTotalStrokes and kRSUnicode is greater than the correct value, and the
> second that where the implied radical stroke count is less than the correct
> value. (The second situation is much more common than the first, accounting
> for at least 80% of the “suspicious” items.
>
>
>
>
>
> Example 1: character U+4E9B ‘a few’
>
>
>
>
>
> kTotalStrokes = 8
>
> kSRUnicode = 7.5
>
> radical number = 7 ‘two’
>
> residual strokes = 5
>
> implied radical stroke count = 3 (8 – 5)
>
> correct radical stroke count = 2
>
> diff = 1 (implied count one too high)
>
>
>
>
>
> The residual portion of the character occurs as an independent character
> U+6B64 ‘this, these’. Its kTotalStrokes is 6 and its kRSUnicode = 77.2.
> The radical #77 ‘stop’ has 2 strokes in its standalone form, so the residual
> stroke count of 2 is consistent with a total count of 6. In the main
> character U+4E9B, therefore, the residual part has effectively lost a stroke
> in composition, being reduced from 6 to 5.
>
>
>
> (This actually seems to be the norm with this phonetic. Other examples are
> U+4F4C, U+5470, U+5472, U+59D5, U+67F4, U+75B5, U+7689, U+7725 and I’m sure
> more.)
>
>
>
>
>
> Example 2: is character U+5040 ‘distinguished person; English person’
>
>
>
> kTotalStrokes = 10
>
> kSRUnicode = 9.9
>
> radical number = 9 ‘person’
>
> residual strokes = 9
>
> implied radical stroke count = 1 (10 - 9)
>
> correct radical stroke count = 2
>
> diff = -1 (implied count one too low)
>
>
>
> Again the residual portion occurs as an independent character U+82F1
> ‘distinguished; English’. Its kTotalStrokes is 8 and its kRSUnicode is
> 140.5. Radical #140 ‘grass’ has 6 strokes in its standalone form but as the
> radical component of a larger character is always abbreviated to a form with
> 3 strokes. That is the case here. Thus residual count of 5 in the
> kRSUnicode of U+82F1 is consistent with the kTotalStrokes of 8 for the
> character. This count of 8 agrees with the residual count for the full
> character U+5040 implied by its 10 kTotalStrokes, but is one less than the 9
> residual strokes specified in the kRSUnicode.
>
>
>
> In both examples the discrepancy between kTotalStrokes and KRSUnicode arise
> out of different residual stroke counts and have nothing to do with the
> radical, be it its identity, the variant used, or the stroke count. While
> there are some exceptions, this is clearly the normal situation. It also
> makes sense. Most disagreements on stroke counts have to do with the
> residual as opposed to the radical portion of the characters. (Question of
> radical counts usually involve cases where the radical has more than one
> form in a given context, for example rad #140 ‘grass’, which has 6 strokes
> in its full form but variants in the top position context with 3 and 4
> strokes. Less commonly, they involve cases where the radical is fused with
> the residual portion or even lost altogether as part of a historical
> simplification.)
>
>
>
> As mentioned above, discrepancies of the type illustrated by the second
> example (implied radical stokes higher than correct) are much more common
> than discrepancies of the type illustrated by the first example (implied
> radical stokes less than correct). To the extent the discrepancies involve
> the residual stroke counts and have nothing to do with the radical, the
> situation can be reframed in terms of residual stroke counts as:
>
>
>
> Dominant pattern: the residual stroke count specified in kRSUnicode is
> greater than that implied by kTotalStrokes (5 vs. 6 strokes in Ex. 2)
>
>
>
> Minor pattern: the residual stroke count specified in kRSUnicode is less
> than that implied by kTotalStrokes (9 vs. 8 strokes in Ex. 1)
>
>
>
> The results of the heuristic test indicate that the great majority of cases
> of both patterns involve differences in residual stroke counts of one or
> occasionally two strokes. I believe this is in line with the variations in
> stroke counting that are observed in actual practice (dictionaries etc.).
> Still, the question needs to be asked, do the discrepancies (which occur in
> 5% of all characters in the base Unicode character set) simply represent
> different, but more or less equally valid, ways of counting strokes, or are
> they errors that need to be corrected or at least addressed in some way?
>
>
>
> In my view the answer depends on a more specific question: are kTotalStrokes
> and KRSUnicode intended to be consistent? That is, regardless of what exact
> count is chosen for a given character, should both terms reflect the same
> count?
>
>
>
> Here is how the two fields are described in the document Proposed Update to
> Unicode Standard Annex #38 Unicode 6.0.0 draft 1
> (http://www.unicode.org/reports/tr38/tr38-8.html):
>
>
>
> kTotalStrokes:
>
>
>
> “The total number of strokes in the character (including the radical). _This
> value is for the character as drawn in the Unicode charts_.”
>
>
>
> kRSUnicode:
>
>
>
> “A standard radical/stroke count for this character in the form
> “radical.additional strokes”. The radical is indicated by a number in the
> range (1..214) inclusive. An apostrophe (') after the radical indicates a
> simplified version of the given radical. The “additional strokes” value is
> the residual stroke-count, the count of all strokes remaining after
> eliminating all strokes associated with the radical.
>
>
>
> This field is also used for additional radical-stroke indices where either a
> character may be reasonably classified under more than one radical, or
> alternate stroke count algorithms may provide different stroke counts.
>
>
>
> _The first value is intended to reflect the same radical as the kRSKangXi
> field and the stroke count of the glyph used to print the character within
> the Unicode Standard_.
>
>
>
> When I talk about kRSUnicode I always mean the first value in the list.
> Similarly my heuristic test always uses the first value. I mention this
> because of the way the last paragraph of the description refers specifically
> to this value.
>
>
>
> Both descriptions tie the specific values of the two fields to the specific
> glyphs used to draw/print the character in the Unicode charts (kTotalStrokes
> “character as drawn in the Unicode charts”, kRSUnicode “the glyph used to
> print the character within the Unicode Standard”). Given this, the answer
> to the question of whether the two fields should be consistent certainly
> seems to be yes. And this means that the cases where they are not, i.e.
> where there are discrepancies, are errors.
>
>
>
> If it’s conceded that the discrepancies do reflect errors, then I think it
> also needs to be conceded that they need to be addressed in some way. The
> most straightforward thing would be to go through all the cases and change
> either kTotalStrokes or kRSUnicode to (a) be consistent and (b) offer values
> appropriate to the specific glyph used in the standard.
>
>
>
> Given that kRSUnicode is used for ordering characters in the block (the
> radical number being used to determine what radical it is listed after and
> the residual count being used to determine where after the radical it
> appears – except for ties, which are ordered arbitrarily), while to the best
> of my knowledge kTotalStrokes is not used for anything within the standard,
> the most practical thing would be to keep the existing kRSUnicode value
> wherever it is not obviously incorrect and adjust the kTotalStrokes to be
> consistent with it.
>
>
>
> But this involves changing a lot of data - including data for the most
> widely used characters, those in the base CJK Unified Ideograph block -, and
> may break systems that use the existing values.
>
>
>
> An alternative which I would suggest is to create a new field which could be
> called kRSUnicode2 or something similar and would have not two but three
> subfields (not counting apostrophe)
>
>
>
> <rad_num>[‘].<rad_strokes>.<res_strokes>
>
>
>
> where the first and third subfields are the same (same meaning, same values,
> barring clear errors) as in kRSUnicode and the added second subfield is the
> number of strokes in the radical as it appears in the character.
>
>
>
> This new field would contain all the stroke count information that’s needed
> for a character, including not only the residual strokes but also the
> radical strokes and, via calculation (adding the two values), the total
> strokes. The last can be compared with kTotalStrokes, but does not depend
> on it, and may be different.
>
>
>
> (Note that the presence of apostrophe would become largely predictable from
> a comparison of the radical stroke count in the first subfield with the
> count for the radical as a standalone character. In fact it would only be
> necessary to retain it if its purpose was not simply to indicate
> significantly abbreviated radicals in general but specifically to indicate
> forms that are used in Chinese Simplified but not the corresponding
> Traditional ones.)
>
>
>
> I see the following advantages to this approach:
>
>
>
> (1) No constraints are placed on existing kTotalStrokes or kRSUnicode
> values – they can be left as is or changed at any point without implications
> for the new kSRUnicode2 values
>
>
>
> (2) No systems that use the existing kTotalStrokes or kRSUnicode fields
> will break or be affected in any way (though they could be changed to use
> the self-standing kRSUnicode2 field with possibly more satisfactory results)
>
>
>
> (3) All stroke information for a character is contained in a single field,
> kRSUnicode2, and can’t be inconsistent (though it can be wrong)
>
>
>
> (4) Stroke counting differences between fields can be directly found and
> quantified (particularly, by comparing the partial stroke information in
> kTotalStrokes and/or kRSUnicode to the full information in kRSUNicode2
>
>
>
> (5) An initial version of the full set of the new kRSUnicode2 field values
> could be generated algorithmically from kTotalStrokes and kRSUnicode and
> then revised by human inspection focusing on the proportionally small amount
> (8% in the base block, 3% overall) of “suspicious” cases detected by a
> heuristic procedure (which I’m sure could be made more accurate than the one
> I used, for example by bringing in more existing information sources)
>
>
>
> The main disadvantages I see are:
>
>
>
> (1) Confusion arising from the overlap between the old and new fields
>
>
>
> (2) The work involved (though anything other than dismissing or postponing
> the issue is going to involve work)
>
>
>
> If there is interest I will be glad to share the results of my heuristic
> test and the program (python) I used to produce them.
>
>
>
> John Armstrong
>
> Cambridge MA
>
>
> _______________________________________________
> Unicode mailing list
> Unicode_at_unicode.org
> http://unicode.org/mailman/listinfo/unicode
>

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Tue Sep 09 2014 - 10:58:54 CDT

This archive was generated by hypermail 2.2.0 : Tue Sep 09 2014 - 10:58:54 CDT