The Unicode Consortium Discussion Forum

The Unicode Consortium Discussion Forum

 Forum Home  Unicode Home Page Code Charts Technical Reports FAQ Pages 
 
It is currently Wed Jul 30, 2014 10:23 pm

All times are UTC - 6 hours [ DST ]




Post new topic Reply to topic  [ 11 posts ] 
Author Message
 Post subject: Character repertoire of GB 2312-80 and GB 12345-90
PostPosted: Sat Oct 20, 2012 9:01 am 
Offline

Joined: Sat Aug 06, 2011 9:02 am
Posts: 43
It seems like these two statements on page 399 of Chapter 2 are contradictory :

First

"Simplified and Traditional Chinese. There are currently two main varieties of written
Chinese: “simplified Chinese” (jiântîzì), used in most parts of the People’s Republic of
China (PRC) and Singapore, and “traditional Chinese” (fántîzì), used predominantly in
the Hong Kong and Macao SARs, Taiwan, and overseas Chinese communities. The process
of interconverting between the two is a complex one. This complexity arises largely because
a single simplified form may correspond to multiple traditional forms, such as U+53F0 台,
which is a traditional character in its own right and the simplified form for U+6AAF 檯,
U+81FA 臺, and U+98B1 颱
."

Second

"There are two PRC national standards, GB 2312-80 and GB 12345-90, which are intended
to represent simplified and traditional Chinese, respectively. The character repertoires of
the two are the same
, but the simplified forms occur in GB 2312-80 and the traditional
ones in GB 12345-90."

Could you explain ?


Top
 Profile  
 
 Post subject: Re: Character repertoire of GB 2312-80 and GB 12345-90
PostPosted: Sat Oct 20, 2012 3:16 pm 
Offline

Joined: Wed Feb 10, 2010 5:00 pm
Posts: 17
Location: San Jose, CA, USA, Earth
GB/T 12345-90 is best described as the traditional analog of GB 2312-80. In other words, for any hanzi in GB 2312-80 that is considered to be simplified, GB/T 12345-90 has, at the same code point, the traditional form. There are approximately 2,000 such cases, which is about one-third of GB 2312-80. In other words, approximately two-thirds of the hanzi in GB 2312-80 and GB/T 12345-90 are the same.

Lets us use Row-Cell 16-10 (0x302A) from both legacy character sets as an example:

% grep kIRG_GSource Unihan_IRGSources.txt | grep "G0-302A"
U+853C kIRG_GSource G0-302A

% grep kIRG_GSource Unihan_IRGSources.txt | grep "G1-302A"
U+85F9 kIRG_GSource G1-302A

The G0 source is GB 2312-80, and the G1 source is GB/T 12345-90. Note that both share the same character code (hexadecimal 302A), but their sources are different. They have different Unicode code points.

The Unihan Database further reflects their status as simplified/traditional pairs:

% grep "^U\+853C" Unihan_Variants.txt
U+853C kTraditionalVariant U+85F9

% grep "^U\+85F9" Unihan_Variants.txt
U+85F9 kSimplifiedVariant U+853C

Does this help?


Top
 Profile  
 
 Post subject: Re: Character repertoire of GB 2312-80 and GB 12345-90
PostPosted: Sat Oct 20, 2012 7:07 pm 
Offline

Joined: Sat Aug 06, 2011 9:02 am
Posts: 43
Lunde

I call your attention again to the phrase in red, in this paragraph on page 399 of Chapter 12 : East Asian Scripts

"Simplified and Traditional Chinese. There are currently two main varieties of written
Chinese: “simplified Chinese” (jiântîzì), used in most parts of the People’s Republic of
China (PRC) and Singapore, and “traditional Chinese” (fántîzì), used predominantly in
the Hong Kong and Macao SARs, Taiwan, and overseas Chinese communities. The process
of interconverting between the two is a complex one. This complexity arises largely because
a single simplified form may correspond to multiple traditional forms, such as U+53F0 台,
which is a traditional character in its own right and the simplified form for U+6AAF 檯,
U+81FA 臺, and U+98B1 颱
."

As far as I can understand this paragraph, for instance, is saying that the character U+53F0 台 is a simplified form of the character U+6AAF 檯.

Let´s then obtain the legacy characters for these two Unicode characters U+53F0 and U+6AAF, the same way you did for the characters U+853C and U+85F9 in your example :

U+53F0 kIRS_GSource G0-4C28
U+6AFF kIRS_GSource G1-786D

which tells me that U+53F0 is not a simplified form of U+6AFF, which in a different way, confirms the inconsistency that I pointed out on my first post.


Top
 Profile  
 
 Post subject: Re: Character repertoire of GB 2312-80 and GB 12345-90
PostPosted: Sat Oct 20, 2012 8:06 pm 
Offline

Joined: Wed Feb 10, 2010 5:00 pm
Posts: 17
Location: San Jose, CA, USA, Earth
Let's see...

Here is the simplified form:

U+53F0 kIRG_GSource G0-4C28

And, here are its three traditional forms:

U+6AAF kIRG_GSource G1-786D
U+81FA kIRG_GSource G1-4C28
U+98B1 kIRG_GSource G1-786C

Here are the simplified/traditional relationships:

U+53F0 kTraditionalVariant U+53F0 U+6AAF U+81FA U+98B1
U+6AAF kSimplifiedVariant U+53F0
U+81FA kSimplifiedVariant U+53F0
U+98B1 kSimplifiedVariant U+53F0

I don't see the problem.


Top
 Profile  
 
 Post subject: Re: Character repertoire of GB 2312-80 and GB 12345-90
PostPosted: Sat Oct 20, 2012 8:19 pm 
Offline

Joined: Wed Feb 10, 2010 5:00 pm
Posts: 17
Location: San Jose, CA, USA, Earth
Part of the confusion is that U+53F0 itself is considered a traditional form. In other words, all four traditional characters—U+53F0, U+6AAF, U+81FA, and U+98B1—folded into a single simplified form, U+53F0, which happens to be the same as one of the four traditional forms. This is precisely why U+53F0 is listed among the kTraditionalVariant values for itself:

U+53F0 kTraditionalVariant U+53F0 U+6AAF U+81FA U+98B1

When the simplified/traditional relationship is pure one-to-one, which is the vast majority of cases, it is as I explained in my first reply to your thread.


Top
 Profile  
 
 Post subject: Re: Character repertoire of GB 2312-80 and GB 12345-90
PostPosted: Sun Oct 21, 2012 7:16 pm 
Offline

Joined: Sat Aug 06, 2011 9:02 am
Posts: 43
Sorry for taking some time to answer your last two posts, but I was trying during this time to get some acquaintance with the Unihan Database, as this was up to now, a complete novelty to me.

I can understand your last two responses. They make sense and I thank you for your input. But I'm still struggling with some points, that I would very much appreciate if you could clarify :

    1. Take for example the character U+53F0. As can be seen here, the properties kGB0 and kGB1 for this character are, respectivelly, 4408 and 8875, which corresponds to the hexadecimals 0x4C28 and 0x786B. Now, kIRG_GSource for the character U+53F0 shows, as expected, the value G0-4C28. Why doesn't it show G1-786B as well ?
    2. I tried for a long time to find on the internet an official site for the PRC standards GB 2312-80 and GB 12345-90. I did find these two unofficial sites for the GB 2312-80, http://www.chinese-tools.com/resources/gb2312-80-table.html and ftp://ftp.oreilly.com/examples/cjkvinfo/AppE/gb2312.pdf, in English, but I couldn't find anything for the standard GB 12345-90. I'd like to know, for example, the glyph for the character code 8875 (kuten form) on this standard, which by the way, is not listed on GB 2312-80. How should I proceed ?
    3. I repeat below the second statement on my first question on this thread :

    "There are two PRC national standards, GB 2312-80 and GB 12345-90, which are intended
    to represent simplified and traditional Chinese, respectively. The character repertoires of
    the two are the same
    , but the simplified forms occur in GB 2312-80 and the traditional
    ones in GB 12345-90."

    Given that the characters U+6AAF, U+81FA and U+98B1 are in GB 12345-90, but not in GB 2312-80, is it correct to say that the character repertoires are the same for these two character sets ?


Top
 Profile  
 
 Post subject: Re: Character repertoire of GB 2312-80 and GB 12345-90
PostPosted: Sun Oct 21, 2012 9:53 pm 
Offline

Joined: Wed Feb 10, 2010 5:00 pm
Posts: 17
Location: San Jose, CA, USA, Earth
Because some simplified forms are also considered traditional forms, and that multiple traditional forms folded into the same simplified forms, the relationship is not one-to-one.

Thus, for GB/T 12345-90, the repertoire is not the same as GB 2312-80. Instead, it is more accurate to state that the repertoires are largely the same and parallel, differing only in that GB/T 12345-90 provides the traditional form in the same relative GB 2312-80 code point.

GB/T 12345-90 includes 103 additional hanzi in rows 88 and 89 (GB 2312-80 includes hanzi in rows 16 through 87). These 103 additional hanzi are somewhat difficult to explain. Some of them are cases in which the simplified form is itself also a traditional form, meaning that two traditional forms folded into one form, and the form that was at the GB 2312-80 code point was moved to row 88 or 89, and replaced by the other traditional form. I believe that these account for 41 of these 103 hanzi.

Pages 897 through 916 of Appendix Q of "CJKV Information Processing" First Edition has the information you seek. You'll find it here: http://examples.oreilly.com/97805965144 ... e-appG.pdf

Scan through the PDF until you find "897" as the page number, which should be the 43rd page of the PDF.

Given that both standards have been supplanted by GB 18030, it is a better use of time, energy, and resources to focus on that standard.


Top
 Profile  
 
 Post subject: Re: Character repertoire of GB 2312-80 and GB 12345-90
PostPosted: Mon Oct 22, 2012 7:47 am 
Offline

Joined: Sat Aug 06, 2011 9:02 am
Posts: 43
Although you didn't respond to my point 1 above, I must say this was a heck of a good answer. Thanks a lot. I really appreciated that.


Top
 Profile  
 
 Post subject: Re: Character repertoire of GB 2312-80 and GB 12345-90
PostPosted: Mon Oct 22, 2012 8:56 am 
Offline

Joined: Wed Feb 10, 2010 5:00 pm
Posts: 17
Location: San Jose, CA, USA, Earth
The answer to Question #1 above, I suspect, is that only a single value is allowed for the kIRG_GSource field. The kIRG_GSource field is not unique in this regard; the kIRG_JSource has similar occurrences, because there is a significant overlap between the JIS X 0212 and JIS X 0213 standards.


Top
 Profile  
 
 Post subject: Re: Character repertoire of GB 2312-80 and GB 12345-90
PostPosted: Mon Oct 22, 2012 9:20 am 
Offline

Joined: Sat Aug 06, 2011 9:02 am
Posts: 43
It's all clear up now. Many thanks !


Top
 Profile  
 
 Post subject: Re: Character repertoire of GB 2312-80 and GB 12345-90
PostPosted: Thu Oct 25, 2012 12:21 pm 
Offline

Joined: Wed Feb 10, 2010 2:51 pm
Posts: 16
Location: Salt Lake City
Yes, there can only be one value for all of the kIRG_*Source fields. We should probably say as much in UAX #38.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 11 posts ] 

All times are UTC - 6 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 2 guests


Quick-mod tools:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Template made by DEVPPL.com