Re: [cldr-dev] Re: Questions on Chinese collation, stroke

From: Stephan Stiller <sstiller_at_stanford.edu>
Date: Tue, 26 Jun 2012 00:36:57 -0400

Hi Matt,

I hope I haven't yet sent these out before, but I think (the most?)
recent prescriptive stroke orders (and radical assignments) for HK and
TW are here
     http://www.edbchinese.hk/lexlist_en/
     http://stroke-order.learningweb.moe.edu.tw/home.do
respectively. Obviously I haven't checked them for consistency.

I agree about the CN source, which I am familiar with.

I'll be curious to hear from anyone about present or future committee
work on this (ie, the ordering – the radical and stroke assignments are
a-whole-nother matter requiring committee work as well, in the ideal
case). Given the importance of this data and its possible impact (if the
Unicode Consortium provides it to the world), I'd really want to have it
be completely right. The easiest way to have this happen would be to
algorithmically generate it, whenever feasible, with manual interference
only where ties need to be broken. At any rate, even if the process ends
up being less strict, it would need to be properly documented, so that
people have a clue which base dataset/principles were meant to be completed.

Stephan

On 6/25/2012 3:02 PM, Matt Ma wrote:
> Hi Stephan,
>
> I agree that those orders require a great deal of work.
>
> For the stroke order, the specification (现代汉语通用字笔顺规范) explicitly shows
> stroke order on 7000 commonly used Simplified Chinese characters in
> P.R. China. It also has a set of rules aiming to reduce the ambiguity
> on how strokes are counted and ordered. Perhaps characters listed in
> the spec can be used as a starter.
>
> Thanks,
> Matt
>
> On Fri, Jun 22, 2012 at 7:43 PM, Stephan Stiller <sstiller_at_stanford.edu> wrote:
>> Dear Matt,
>>
>> I think those tasks would take a quite a bit of work, because (1) the three
>> orders you are mentioning are all mathematically underspecified and (2)
>> they're partial orders even when considering only what you'd normally
>> consider the respective target domains (certain subsets of CJKV).
>>
>> I'm sure many or most people reading this know this, but the question is
>> which committee would get rid of the underspecification (also, according to
>> what principles?), fine-tune the respective target domains, and such.
>> (Perhaps the IICore people have done parts of the footwork already?)
>>
>> Stephan
>>
>>
>> On 6/22/2012 5:05 PM, Matt Ma wrote:
>>> Entered ticket #4949 for Simplified Chinese, stroke order.
>>>
>>> Thanks,
>>> Matt
>>>
>>> On Fri, Jun 22, 2012 at 12:55 PM, Mark Davis ☕ <mark_at_macchiato.com> wrote:
>>>> There are no current plans to do that. If you want to present a case for
>>>> adding additional collation sequences to CLDR, please start the process
>>>> by
>>>> filing a bug at http://unicode.org/cldr/trac/newticket
>>>>
>>>> ________________________________
>>>> Mark
>>>>
>>>> — Il meglio è l’inimico del bene —
>>>>
>>>>
>>>>
>>>> On Fri, Jun 22, 2012 at 11:05 AM, Matt Ma <matt.ma.umail_at_gmail.com>
>>>> wrote:
>>>>> Thanks all for clarification. Are there any plans to provider the
>>>>> following collations in CLDR?
>>>>>
>>>>> 1. Simplified Chinese, stroke order, based on 现代汉语通用字笔顺规范 (PRC-China
>>>>> modern Chinese commonly used characters standard stroke orders,
>>>>> mentioned in http://en.wikipedia.org/wiki/Stroke_order).
>>>>>
>>>>> 2. Simplified Chinese, radical order
>>>>>
>>>>> 3. Traditional Chinese, radical order
>>>>>
>>>>> Thanks,
>>>>> Matt
>>>>>
>>>>> On Sat, Jun 9, 2012 at 1:02 AM, Katsuhiko Momoi <katmomoi_at_gmail.com>
>>>>> wrote:
>>>>>> Unihan-6.2.0d1/Unihan_DictionaryLikeData.txt is lacking the Traditional
>>>>>> Chinese stroke count. Currently it only lists:
>>>>>>
>>>>>> U+8303 kTotalStrokes 8
>>>>>>
>>>>>> I filed a ticket for a review:
>>>>>>
>>>>>> http://unicode.org/cldr/trac/ticket/4898
>>>>>>
>>>>>> (I understand that we are supposed to list the Traditional stroke count
>>>>>> after the Simplified one delimited by a {sp}.
>>>>>>
>>>>>> As a general observation, I glanced through a number of kTotalStrokes
>>>>>> entries for strokes 8 and 9. I did not find a single entry that listed
>>>>>> 2
>>>>>> stroke counts. This seems odd as there should be other stroke count
>>>>>> differences between Simplified and Traditional Chinese. I suspect that
>>>>>> this
>>>>>> is an area needing more than one correction -- it would be better to do
>>>>>> a
>>>>>> systematic review.
>>>>>>
>>>>>> - Kat
>>>>>>
>>>>>> On Fri, Jun 8, 2012 at 3:44 PM, Mark Davis ☕ <mark_at_macchiato.com>
>>>>>> wrote:
>>>>>>> It can supply the data for both, if they differ. That's done with two
>>>>>>> fields.
>>>>>>>
>>>>>>> However, in this case there is only one value; if that's incorrect for
>>>>>>> this character someone should file feedback.
>>>>>>>
>>>>>>> ________________________________
>>>>>>> Mark
>>>>>>>
>>>>>>> — Il meglio è l’inimico del bene —
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 8, 2012 at 2:41 PM, Claire Ho (賀靜蘭) <claireho_at_google.com>
>>>>>>> wrote:
>>>>>>>> Check the tr38, from the description of kTotalStrokes, it provides
>>>>>>>> stroke
>>>>>>>> count data for simplified Chinese and traditional Chinese.
>>>>>>>> Then, I don't have concern.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Claire.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 8, 2012 at 2:33 PM, Claire Ho (賀靜蘭) <claireho_at_google.com>
>>>>>>>> wrote:
>>>>>>>>> Hi Mark
>>>>>>>>>
>>>>>>>>>> There you find the line:
>>>>>>>>>> U+8303 kTotalStrokes 8
>>>>>>>>> In Traditional Chinese, U+8303 has 9 strokes as Matt mentioned in
>>>>>>>>> the
>>>>>>>>> email.
>>>>>>>>>
>>>>>>>>> The radical "++" is counted as 4 strokes. I think there are several
>>>>>>>>> radicals have the same issue, different stroke counts, between
>>>>>>>>> simplified
>>>>>>>>> Chinese and traditional Chinese.
>>>>>>>>>
>>>>>>>>> Claire.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jun 7, 2012 at 5:54 PM, Mark Davis ☕ <mark_at_macchiato.com>
>>>>>>>>> wrote:
>>>>>>>>>> On Thu, Jun 7, 2012 at 4:28 PM, Matt Ma <matt.ma.umail_at_gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I have two questions regarding the collation sequence defined in
>>>>>>>>>>> zh.xml, CLDR 21.0
>>>>>>>>>>>
>>>>>>>>>>> 1. Why is U+8303 (范) counted as 9 strokes instead of 8 for
>>>>>>>>>>> <collation
>>>>>>>>>>> type="stroke">? As a reference, U+59DA (姚) is counted as 9 strokes
>>>>>>>>>>> but
>>>>>>>>>>> sorted before U+8303 (范).
>>>>>>>>>>
>>>>>>>>>> CLDR now gets the stroke collation data from the kTotalStokes
>>>>>>>>>> property.
>>>>>>>>>> The values for that are in the
>>>>>>>>>> file Unihan/Unihan_DictionaryLikeData.txt in
>>>>>>>>>> the Unicode Character Database.
>>>>>>>>>>
>>>>>>>>>> There you find the line:
>>>>>>>>>>
>>>>>>>>>> U+8303 kTotalStrokes 8
>>>>>>>>>>
>>>>>>>>>> If that is in error, or if there is any other error in
>>>>>>>>>> the kTotalStrokes data, then please report the correct value
>>>>>>>>>> according to
>>>>>>>>>> http://www.unicode.org/review/pri230/ so that it can be fixed.
>>>>>>>>>>
>>>>>>>>>> As a related matter, CLDR now gets the pinyin collation data from
>>>>>>>>>> the kMandarin property. The values for that are in the
>>>>>>>>>> file Unihan/Unihan_Readings.txt in the Unicode Character Database.
>>>>>>>>>> So if any
>>>>>>>>>> of those are in error, they should also be reported as
>>>>>>>>>> per http://www.unicode.org/review/pri230/ .
>>>>>>>>>>
>>>>>>>>>> The beta data is
>>>>>>>>>> in ftp://www.unicode.org/Public/6.2.0/ucd/. Currently
>>>>>>>>>> in ftp://www.unicode.org/Public/6.2.0/ucd/Unihan-6.2.0d1.zip
>>>>>>>>>> but as the beta proceeds, the d1 might change to d2,d3...
>>>>>>>>>>
>>>>>>>>>>> 2. Does the collation type, stroke, apply to both Simplified and
>>>>>>>>>>> Traditional Chinese, as I do not see anything defined in
>>>>>>>>>>> zh_Hant.xml
>>>>>>>>>>> under "stroke"?
>>>>>>>>>>
>>>>>>>>>> Let me look at that.
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Matt
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Katsuhiko Momoi <katmomoi_at_gmail.com>
>>>>>>
>>>>>>
>>
Received on Mon Jun 25 2012 - 23:47:16 CDT

This archive was generated by hypermail 2.2.0 : Mon Jun 25 2012 - 23:47:18 CDT