CJK Ideograph Encoding Velocity (was: Re: Unicode Emoji 11.0 characters now ready for adoption!)

From: Ken Whistler via Unicode <unicode_at_unicode.org>
Date: Mon, 5 Mar 2018 09:40:46 -0800

John,

I think this may be giving the list a somewhat misleading picture of the
actual statistics for encoding of CJK unified ideographs. The "500
characters a year" or "1000 characters a year" limits are administrative
limits set by the IRG for national bodies (and others) submitting
repertoire to the "working set" that the IRG then segments into chunks
for processing to prepare new increments for actual encoding.

In point of fact, if we take 1991 as the base year, the *average* rate
of encoding new CJK unified ideographs now stands at 3379 per annum
(87,860 as of Unicode 10.0). By "encoding" here, I mean, final, finished
publication of the encoded characters -- not the larger number of
potentially unifiable submissions that eventually go into a publication
increment. There is a gradual downward drift in that number over time,
because of the impact on the stats of the "big bang" encoding of 42,711
ideographs for Extension B back in 2001, but recently, the numbers have
been quite consistent with an average incremental rate of about 3000 new
ideographs per year:

5762 added for Extension E in 2015

7463 added for Extension F in 2017

~ 4934 to be added for Extension G, probably to be published in 2020

If you run the average calculation including Extension G, assuming 2020,
you end up with a cumulative per annum rate of 3200, not much different
than the calculation done as of today.

And as for the implication that China, in particular, is somehow limited
by these numbers, one should note that the vast majority of Extension G
is associated with Chinese sources. Although a substantial chunk is
formally labeled with a "UK" source this time around, almost all of
those characters represent a roll-in of systematic simplifications, of
various sorts, associated with PRC usage. (People who want to check can
take a look at L2/17-366R in the UTC document registry.)

--Ken

On 3/5/2018 7:13 AM, via Unicode wrote:
> Dear All,
>
> to simplify discussion I have split the points. <unicode_at_unicode.org [1]

>
>>
>>>
>>>
>>> On 2018/03/01 12:31, via Unicode wrote:
>>>
>>>> Third, I cannot confirm or deny the "500 characters a year" limit, but
>>>> I'm quite sure that if China (or Hong Kong, Taiwan,...) had a real
>>>> need
>>>> to encode more characters, everybody would find a way to handle these.
>
>
> Chinese characters for Unicode first go to IRG (or ISO/IEC
> JTC1/SC2/WG2/IRG) website. The limit of 500 a year for China is an
> average based on IRG #48 document regarding working set 2017
> http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg48/IRGN2220_IRG48Recommends.pdf
> which explicitly states "each submission shall not exceed 1,000
> characters". The People's Republic of China as one member of IRG is
> limited to 1,000 characters, which hopefully we can all agree has a
> population of over 1,000,000,000 , therefore was limited to submitting
> at most 1,000 characters. The earliest possible date for the next
> working set is two or three years later, that is 2019 or 2020, so
> that's an average limit of either 500 or 333 characters a year.
>
> Regards
> John
>
>
>
>
Received on Mon Mar 05 2018 - 11:41:16 CST

This archive was generated by hypermail 2.2.0 : Mon Mar 05 2018 - 11:41:16 CST