Proposed Draft Unicode Technical Report #43

A User’s Guide to the UniTangut Database

Author	Richard Cook
Date	2006-12-12
This Version	http://www.unicode.org/reports/tr43/tr43-0.html
Previous Version	None
Latest Version	http://www.unicode.org/reports/tr43/
Tracking Number	0

Summary

This document describes the organization and content of the UniTangut Database.

Status

This document is a Proposed Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

[Note to reviewers: This document describes the UniTangut.txt data set, proposed for addition to the UCD. The proposal and associated data have not yet been submitted, reviewed, revised, or accepted, and consequently this TR is a pre-proposed draft. This document was constructed as a merger of Unihan.html 5.0 and tr38-3.html, first by globally changing all references to UniHan to UniTangut, and then rewriting, rearranging, and adding content as needed. In general, please consider this document as a draft of a template which might also be more generally applied to ongoing revisions to and consolidation of Unihan documentation, and to documentation of mapping data for other large character sets (e.g. Evyptian Hieroglyphs). ]

1 Introduction to the UniTangut Database
2 Mechanics
- 2.1 Database Design
- 2.2 Web Access
- 2.3 UniTangut.txt
- 2.4 UniTangut.xml
3 Property Types: Status and Category
- 3.1 Properties by UTC Status
- 3.2 Properties by Usage Category
4 Properties in Detail
- 4.1 Property Metadata Types
- 4.2 Property Metadata, Alphabetically by Property Tag
References
Modifications

1 Introduction

The UniTangut Database is the repository for the Unicode Consortium’s collective knowledge regarding the Tangut character block of the Unicode Standard. The UniTangut Database contains mapping data linking the encoded characters to the primary print-sources and legacy encodings. The UniTangut Database is modelled after the Unihan Database (documented in TR38), and employs the same structural conventions, allowing the same data management tools to be used on both data sets.

This document is a guide to the UniTangut Database, describing the mechanics of the database, the nature of its contents, and the status of its various fields. The UniTangut Database is a work in progress: existing data is being refined, and new data is being added on a regular basis.

The UniTangut Database exists in three forms, two of which are available to the public:

search interface on the Unicode Web site (public)
text file UniTangut.txt and documentation, distributed as part of the Unicode Standard (public)
master database (internal use only)

The structures and relations among these forms of the UniTangut Database are described in 2; general Property Types (Status and Category) are described in 3; and Property Metadata Types are outlined in 4.

2 Mechanics

2.1 Database design

The working copy of the UniTangut Database is maintained by the Unicode Consortium. The two public versions are reflections of this data at the time of a version release.

As with Unihan, the master (working) copy of the UniTangut data is stored in an SQL database with two main tables (joined on their tag fields):

the Main Table stores character property data, with one record per code point, and multiple columns per record, one column per tag;
the Tags Table stores metadata documenting the sources of and structure of the data in the first table.

For public release, the above two tables are exported to a pair of tab-delimited UTF-8 files:

The exported Main Table serves as input to the program generating the public UniTangut.txt file, and also serves as the basis for statistical information included in the Property Metadata;
The exported Tags Table provides the remainder of the Property Metadata.

Most UniTangut database fields in the master SQL database are made available in the public releases. Fields not part of a public release are of several types:

fields not yet approved by the UTC for public release;
fields in progress (not yet under UTC consideration for public release);
fields needed only for internal data-management purposes;
fields of very limited general utility;
defunct fields (slated for deletion);
fields which can be determined algorithmically from other data in the database.

2.2 Web Access

When the UniTangut Database has been publicly released, the release version of the data serves as input to a second SQL database, used for the online browser-based query system. It is important to note that this searchable version of the database is identical in content to the version release. End-users using online browser-based query system do not query the working copy of the UniTangut Database.

The searchable web interface to Unicode’s Tangut data is available through the Main UniTangut Data Portal.

2.3 UniTangut.txt

The public UniTangut.txt property list file is UTF-8. The file consists of one or more header comment lines (/^#/) followed by lines of data; a trailing comment (/^#/) ends the whole file, giving the overall line-count of the file [including all comment lines]). The file has Unix line breaks (U+000A).

The UniTangut.txt text file consists of nearly two million bytes of data in roughly 100,000 lines, covering all 5,805 encoded Tangut characters.

Each line (record) of the file UniTangut.txt consists of three tab-separated fields.

Field 1 contains the Unicode Code Point in the prefixed hexadecimal form U+XXXXX (/^U\+[0-9A-F]{5}$/; that is, there are five hex digits following the U+ prefix).
Field 2 contains a Property Tag, i.e. a source name abbreviation indicating the type or source of information in the third field. These tags are documented in the present document.
Field 3 contains the Property Value value (in UTF-8). These values take forms which are tag-specific, and also documented in the present document.

The following table shows an example record from UniTangut.txt, for the example Code Point U+17000 (Field 1); each line provides a Property Value (Field 3) for a unique Property Tag (Field 2):

Field 1	Field 2	Field 3
Code Point	Property Tag	Property Value
`U+17000`	`tB5`	`FA40`
`U+17000`	`tNevsky`	`I-178`
`U+17000`	`tNishida`	`1-051`
`U+17000`	`tPUA`	`U+E000`
`U+17000`	`tSN`	`1`
`U+17000`	`tSofronov`	`1075`
`U+17000`	`tTY`	`0001`
`U+17000`	`tTYBH`	`5-6`
`U+17000`	`tTYBS`	`1`
`U+17000`	`tTYP`	`9`
`U+17000`	`tTYYY`	`1.43`
`U+17000`	`tTYYZ`	`53B18`
`U+17000`	`tUNI`	`17000`
`U+17000`	`tWHYJ`	`53.171`
`U+17000`	`tWenHai`	`1460`
`U+17000`	`tXiaHan`	`0100`
`U+17000`	`tYTYL`	`2126.10`

The UniTangut Database consists of a number of fields containing data for each Tangut character in the Unicode Standard. The field names consist entirely of ASCII letters and digits with no spaces or other puncutation except for underscore. On the model of Unihan.txt (which for historical and perhaps other reasons uses a lower-case k [= Kanji?] field name prefix), all UniTangut.txt field names start with a lower-case t.

This general naming convention (/^[tk][A-Z_]+$/i) is admittedly unnecessary in an XML UCD, but nevertheless provides some redundancy in the tab-delimited text files, and also emphasizes the fact that UniTangut.txt is intended to be managed with trivial (or no) modification to existing Unihan.txt processing tools. This may prove useful in simplifying migration of these parts of the UCD to XML.

For most mapping sources, if multiple values are possible in Field 3, the values are separated by spaces (but see Section 4: Syntax). No Tangut character may have more than one instance of a given field associated with it, and no empty fields are included in the UniTangut.txt file. Each code point in the block may occur at the head (Field 1) of one or more records (lines), depending on the number of fields (tags) with records for that code point.

There is no formal limit on the lengths of any of the field values. Any Unicode characters may be used in the field values except for unescaped control characters (especially tab, newline, and carriage return). Most fields have a more restricted Syntax.

The data lines are sorted with code point as primary sort key, and field-name as secondary sort key. If the property value itself is structured, its values may also be sorted according to a sorting method detailed in the property description.

The header comment lines contain very limited metadata regarding the file itself, including the file name, version, date of production, and a pointer to the main documentation (→ you are here ←).

Ranges of Tangut code points valid for Field 1 of UniTangut.txt are listed in the following table:

Code point range	Block name	Release
U+17000 .. U+186AC	TANGUT	unassigned
U+186AD .. U+186FF	TANGUT EXTENSION A	unproposed

Note that Tangut characters in the following ranges do not have mapping data in Field 1 of UniTangut.txt (though they may at some future time):

Code point range	Block name	Release
N.A.	TANGUT RADICALS	unproposed
N.A.	TANGUT RADICALS SUPPLEMENT	unproposed
N.A.	TANGUT STROKE TYPES	unproposed
N.A.	TANGUT COMPATIBILITY CHARACTERS	unproposed

2.4 UniTangut.xml

Future incarnations of the public UniTangut Database release may include a UniTangut.xml XML representation of the data and metadata.

3 Property Types: Status and Category

Each UniTangut Database field (a.k.a. tag, property) is classified according to the formal Status of this property within the Unicode Standard, as determined by the UTC.

Each field is also classified by usage Category, according to the purpose it serves.

We provide here a general discussion of these two basic classifications (UniTangut tags, by Status and by Category), followed by an overview of Property Metadata, and detailed descriptions of the individual Properties, alphabetically arranged.

Note that all data in the UniTangut Database has been donated to the Unicode Consortium, and that proofing, augmentation an dpublication of the data is an ongoing process, subject to available resources. If data satisfying a certain need is not currently present in the database, end-users are encouraged to contribute well-documented data for possible inclusion.

3.1 UniTangut Properties by Status

Each property has a formal Status, as determined by the UTC.

In the list of UniTangut properties (fields, tags) given below, each property is assigned a formal status. Only a few UniTangut properties (may eventually) correspond to Unicode Normative or Informative properties: most all are Provisional. For information on the meanings of the Normative, Informative and Provisional Status flags, see definitions D33, D35, and D36 in Chapter 3 Properties of Unicode 5.0 [U5.0]. For information on properties and on the general structure of the Unicode Character Database, see UCD.html.

Status	Properties with this status
Normative	none
Informative	none
Provisional	all

3.2 UniTangut Properties by Category

Each property is also assigned to one or more functional Category, according to presumed utility of the field data. We distinguish the following usage categories for fields.

Category	Category Description	Properties in this category
Dictionary Indices	References to primary lexical treatments of this script entity.	tTYYZ, tTY, tWenHai, tXiaHan, tSofronov, tKychanov, etc.
Dictionary-like Data	Data derived from primary lexical sources, including phonologic, gloss, frequency, etc.	tTYYY, tTYP, etc.
WG Mappings	Mappings established by a WG2 working group.	pending (a “TRG” might expand this character set, add mappings, and resolve variant issues for non-extinct users)
Numeric Values	The numeric value(s) of a character with this property.	pending
Other Mappings	Mappings to legacy ecodings.	tB5, tPUA
Radical-Stroke Counts	Radical (lexical classifier) and Residual Stroke-count assignments	tTYBS, tTYBH
Variant relations	Mappings between encoded characters, establishing usage identity according to some usage authority.	tHXM

4.0 UniTangut Properties in Detail

4.1 Property Metadata Types

For each field the alphabetical listing (4.2) gives the following information:

Property Metadata Type	Description
Tag	The immutable abbreviation serving as a unique key identifiying this property. Each tag matches regex `/^t[A-Z_]+$/i`. The tag is used in the `UniTangut.txt` file (and in the SQL database whence it derives) to mark each instance of this field. The concatenation of Unicode code point + tag uniquely identifies each record in the database.
Status	Formal UTC status, Normative, Informative, or Provisional, depending on whether it is a Normative part of the standard, an Informative part of the standard, or neither; see 3.1.
Category	Usage classification; see 3.2.
Added	Unicode version in which this property first appeared.
Modified	Unicode version in which this property was last modified; modification may be indicated as the result of change in any mutable Property Metadata category (i.e., Tag cannot change, but any of Status, Records, Description, and Values might. Not all types of modification need be noted here, but important ones relating to Status, Records, and Values will.
Syntax	Constraints on property values, as described by a regex; fields which allow multiple values will also specify the delimiter in the regex; Syntax is a Perl 5.8 regular expression describing the formal structure of an individual Value. For example, the Syntax for the `tTYYZ` field is `/^\d{1,2}[AB]\d{1,2}$/`, which means that this tag permits only single values, and that the value begins with one or two digits, followed by an A or B, followed by one or two digits. The syntax can be used to validate the contents of a field. The regex can be written to varying degrees of stringency, given ample time and testing.
Records	Total number of Unicode characters having a record for this property.
Description	Description of the property, including a unique source identifier (bibliographic etc.), and notes of various types relevant to interpretation of the property data; the Description contains not only a description of what the field contains, but also source information, known limitations, methodology used in deriving the data, and so on.
Values	The actual property data per tag associated with each character in the UniTangut Database. These values are not included in 4.2 below, but must be gotten from the UniTangut Database itself.

As is the property data, so too the property metadata is a work in progress. Property Metadata types may be added, and existing types may be refined in the future. For example, bibliographic information may be extracted from the Description, and isolated in a new Bibliography property metadata type.

4.2 Property Metadata, Alphabetically by Property Tag

We now list the Properties of the UniTangut Database alphabetically, giving the above types of Property Metadata (4.1) for each.

Property Tag	$tTag
Status	$status	Category	$category
Added	$added	Modified	$modified
Syntax	$syntax	Records	$records
Description	$description

References

[Feedback]	http://www.unicode.org/reporting.html For reporting errors and requesting information online.
[Reports]	Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[Unicode]	The Unicode Standard, Version 5.0
[Versions]	Versions of the Unicode Standard http://www.unicode.org/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them.

Modifications

This section indicates the changes introduced by each revision.

Revision 0

Zeroth version

Copyright © 2006 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.

Proposed Draft Unicode Technical Report #43

A User’s Guide to the UniTangut Database

Summary

Status

Contents