[Unicode]   Technical Reports
 

L2/08-053
Proposed Draft Unicode Technical Report #NN

U-source Ideographs

Author John Jenkins 井作恆 (jenkins@apple.com)
Date 2007-11-01
This Version http://www.unicode.org/reports/trNN/trNN-1.html
Previous Version n/a
Latest Version http://www.unicode.org/reports/trNN/
Tracking Number 1

Summary

This document describes the U-source ideographs in the Unicode standard.

Status

This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents


1 Introduction

This documents describes U-source ideographs as used by the Ideographic Rapporteur Group (IRG) in its Han unification work. The IRG consists of experts representing all the East Asian nations using ideographs in their writing and is the international standards body that does the actual unification work for Han.

The U-source consists of the CJK ideographs which have been submitted to the UTC as potential candidates for encoding. Not all of these are, in fact, suitable candidates for encoding, and their inclusion in this document should not be taken as approval for their encoding on the part of the UTC.

The actual U-source data are found in two additional files:

2 Text File Data

The text file consists of UTF-8 text with Unix line-endings (LF). Each line consists of seven fields separated by semicolons.
  1. The ideograph's U-source identifier. This consists of the letters "UTC" followed by five decimal digits, starting with 00001. Identifier numbers are not skipped, and are not reused. Once an ideograph is assigned a U-source identifier, that is fixed.
  2. A single character indicating the ideograph's current status. These are described below.
  3. A Unicode code point. This is either the code point assigned the ideograph (if it is encoded) or the code point of the character for which this is a variant (if the status is "V"). If it is neither encoded nor a variant, then this field is empty.
  4. A radical-stroke index for the ideograph, as described in [UAX38].
  5. A KangXi dictionary index for the ideograph, as described in [UAX38].
  6. An ideographic description sequence (IDS) for the ideograph, if one can be generated.
  7. A string indicating the ideograph's source and an optional index within the source.

2.1 The Status Field

The status field reflects the ideograph's current status. The value of this field can change over time. The possible values are C, D, U, V, W, and X.

A status of C means that the ideograph is found in Extension C. This is currently under ballot in WG2.

A status of D means that the ideograph has been submitted to the IRG as part of the UTC's Extension D proposal.

A status of U means that the ideograph is already encoded in Unicode. Characters with a status of U were either added to the U-source database in error, or are characters encoded in Unicode before the IRG began its work.

A status of V means that the ideograph is a variant of a character encoded in Unicode. These variants are not limited to Z-variants. Other variants include glyphs with components rearranged (e.g., UTC00344, which rearranges the components of U+69AB but is pronounced the same and means the same), simplified versions of encoded characters (e.g., UTC00842), and ideographs which mean the same and are pronounced the same as encoded ideographs and have a sufficiently similar shape as to be easily mistaken for one another (e.g., UTC00399). This is a deliberately less strict, if somewhat more subjective, standard than is used for unification work.

A status of W means that the ideograph is not suitable for encoding. An example here is UTC00118, which is used as a decoration in the novels Xenocide and Children of the Mind by Orson Scott Card. While the character does have an apparent intended meaning (something like "monster-killer"), it isn't suitable for encoding because of its ad hoc, nonce nature and lake of generalized use outside of the context of two specific English-language novels. The bulk of the characters with a status of W are Wenlin-specific Z-variants which should be represented (if at all), via a variation sequence defined by Wenlin, not by the UTC.

Finally a status of X means the ideograph is a candate for inclusion in an encoding proposal post-dating Extension D.

2.2 The Source Field

The source field consists of source information, which generally consists of a source tag usually followed by a source-specific index string. Source tags and indices are separated by a space, and multiple source indices are separated by commas. Multiple sources are separated by asterisks.

The source tag may be a URI, in which case the index string is the date (year-month-date) when the URI was accessed. The source tag may also be a U-source index for cases where an ideograph was added to the U-source twice. The source tags beginnig with a lower-case k correspond to fields within the Unihan database. Please consult [UAX38] for information on these sources and the format and meaning of the index strings.

The remaining sources listed below. The left column contains the source tag. The right column contains bibliographic information for the source plus a description of source index tags, if any.

ABC2 DeFrancis, John. ABC Chinese-English Dictionary Honolulu: University of Hawaiʼi Press, 1999.
No source index
Adobe-Japan1-6 The Adobe-Japan1-6 glyph collection
The glyph index within the set
Cheng Cheng Tso-Hsin, ed. A complete checklist of species and subspecies of the Chinese birds Beijing: Science Press, 2000.
No source index
CN Vũ Văn Kính, ed. Đại Tự Điển Chữ Nôm Ho Chi Minh City: Nhà xuấ bản văn nghệ.1998? [I need someone Vietnamese to translate the rest of the title page for me]
A string of the form [01][0-9]{3}\.[0-9]{2} indicating the page and position on the page.
DYC [Richard needs to supply this]
GB18030-2000 GB18030-2000
No source index
LDS "Required Character List Supplied by The Church of Jesus Christ of Latter-day Saints"
The character index within the document
Shangwu Huang Giangshang, ed. Shangwu Xin Cidian Hong Kong: The Commercial Press, 1991. ISBN 962-07-0133-X
A string of the form [0-9]{3}\.[0-9]{2} indicating the page and position on the page.
TUS The Unicode Consortium. The Unicode Standard, Version 1.0, Volume 2 Reading, Mass.: Addison-Wesley Publishing Company, 1992. ISBN 0-201-60845-6
The character's code point in the form U\+FA[0-9A-F]{2}
UDR A defect report filed against the Unicode Standard or other direct communication with the Unicode editorial committee
No source index
WL Wenlin v. 3.1.8 http://www.wenlin.com
The PUA code point assigned the ideograph in the form E[0-9A-F]{3}
XHC 中国社会科学院语言研究所词典编辑室, ed.Xiandai Hanyu CidianBeijing: The Commercial Press. 2003
A string of the form [01][0-9]{3}\.[0-9]{2} indicating the page and position on the page.

References

[Feedback] http://www.unicode.org/reporting.html
For reporting errors and requesting information online.
[Reports] Unicode Technical Reports
http://www.unicode.org/reports/
For information on the status and development process for technical reports, and for a list of technical reports.
[UAX38] Unicode Han Database (Unihan)
[Unicode] The Unicode Consortium. The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0)
[Versions] Versions of the Unicode Standard
http://www.unicode.org/versions/
For details on the precise contents of each version of the Unicode Standard, and how to cite them.

Modifications

This section indicates the changes introduced by each revision.

Revision 1