Re: Questions on Chinese collation, stroke

From: Mark Davis ☕ <mark_at_macchiato.com>
Date: Thu, 7 Jun 2012 17:54:01 -0700

On Thu, Jun 7, 2012 at 4:28 PM, Matt Ma <matt.ma.umail_at_gmail.com> wrote:

> Hi,
>
> I have two questions regarding the collation sequence defined in
> zh.xml, CLDR 21.0
>
> 1. Why is U+8303 (范) counted as 9 strokes instead of 8 for <collation
> type="stroke">? As a reference, U+59DA (姚) is counted as 9 strokes but
> sorted before U+8303 (范).
>

CLDR now gets the stroke collation data from the kTotalStokes property. The
values for that are in the file Unihan/Unihan_DictionaryLikeData.txt in the
Unicode Character Database.

There you find the line:

U+8303 kTotalStrokes 8

If that is in error, or if there is any other error in the kTotalStrokes
data, then please report the correct value according to
http://www.unicode.org/review/pri230/ so that it can be fixed.
As a related matter, CLDR now gets the pinyin collation data from
the kMandarin property. The values for that are in the
file Unihan/Unihan_Readings.txt in the Unicode Character Database. So if
any of those are in error, they should also be reported as per
http://www.unicode.org/review/pri230/ .

The beta data is in ftp://www.unicode.org/Public/6.2.0/ucd/. Currently in
ftp://www.unicode.org/Public/6.2.0/ucd/Unihan-6.2.0d1.zip
but as the beta proceeds, the d1 might change to d2,d3...

>
> 2. Does the collation type, stroke, apply to both Simplified and
> Traditional Chinese, as I do not see anything defined in zh_Hant.xml
> under "stroke"?
>

Let me look at that.

>
> Thanks,
> Matt
>
>
>
Received on Thu Jun 07 2012 - 19:58:15 CDT

This archive was generated by hypermail 2.2.0 : Thu Jun 07 2012 - 19:58:16 CDT