Proposed Draft Unicode Technical Standard #37

Registration of Ideographic Variation Sequences

Version	1.0 Proposed Draft
Authors	Hideki Hiura Eric Muller (emuller@adobe.com)
Date	2005-07-16
This Version	http://www.unicode.org/reports/tr37/tr37-1.html
Previous Version	n/a
Latest Version	http://www.unicode.org/reports/tr37/
Revision	1

Summary

This document describes the operation of the Ideographic Variation Database.

Status

This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

1 Introduction

2 Description

3 Format of the Ideographic Variation Database

4 Registration Procedure
- 4.1 Registration of a Collection
- 4.2 Registration of Sequences in a Collection
- 4.3 Registrar
- 4.4 Registration Fees

Appendix A. Registration Form for Collections

Appendix B. Acknowledgments

References

Modifications

1 Introduction

Characters in the Unicode Standard can be represented by a wide variety of glyphs. Occasionally the need arises in text processing to restrict or change the set of glyphs that are to be used to display a character. In special circumstances, this restriction needs to be expressed in plain text rather than by font selection or some other rich text mechanism. The Unicode Standard accomodates those circumstances with variation selectors: the code point of a graphic character can be followed by the code point of a variation selector to identify a restriction on the graphic character. The combination of a graphic character and a variation selector is known as a variation sequence (See Section 15.6, Variaton Selectors of [Unicode]).

In the case of Han ideographs, it is impossible to build a single collection of variation sequences that can satisfy all the needs of the users. The requirements from scholars, governments and publishers are too different to be accomodated by a single collection. Instead they can be met by having multiple independent collections. The Ideographic Variation Database ensures that a given variation sequence is used in at most one collection, to make interchange of text using such variation sequences reliable.

2 Description

The Ideographic Varation Database records registered variation sequences made of a base character with the Ideographic property and one of the variation selectors in the range U+E0100 to U+E01EF. Those variations sequences are suitable for use in text interchange.

[Ed Note: will need to modify StandardizedVariants.html to include the database by reference.]

To guarantee the stability of data encoded using registered variation sequences, those sequences are never removed or reassigned.

Ideographic Variation Sequences are subject to the usual rules for variation sequences: unregistered sequences should not be used, and registered sequences should be used according to their intent. Furthermore, variation selectors are default ignorable. This implies that registrants should carefully consider whether a variation sequence, when viewed as a possible set of glyphs, is indeed a subset of the glyphs which are acceptable for the base character alone.

The registration process only ensures that once a registrant has successfully registered a variation sequence, the specific combination of base character and variation selector used to represent that sequence in text will not be used to register another sequence. However, this does not prevent two distinct variation sequences from representing the same visual form or variant of a base character.

There is no guarantee that two sequences involving the same variation selector on different base characters have any relationship nor will a variation selector be designated, independantly of any base, for any purpose. Should there be a requirement to register more than 240 sequences involving the same base character, the Unicode Consortium will seek the encoding of additional variation selectors, and make those available for registration of Ideographic Variation Sequences.

Registration of a collection does not imply suitability for any purpose. The usefulness of a given variation sequence, and the usefulness of a collection as a whole depends to a large extent on their use. Registrants are encouraged to describe the intent of their collections, and users are also encouraged to evaluate whether a collection is useful for their purpose.

[Ed note, from Rick: But (a point of interest to some) neither the registry nor the registrar guarantees the perpetual availability of the registered entities. Thus, it is possible at some future time for the definitions pointed to by the registry become unavailable for query or use.

(To me as an end-user, this would be a fatal flaw in the usefulness of the registry. I could have documents containing these sequences, and yet at some future time, not be guaranteed ever to be able to find out what they mean. This contrasts with the standard itself, in which one can always find out the contents by having a copy of the standard. Somehow, users of this registry should be cautioned in this.)]

3 Format of the Ideographic Variation Database

The Ideographic Variation Database consists of two data files. The first, IVD_Collections.txt records the registered collections. The second, IVD_Sequences.txt records the registered sequences.

In each file, lines starting with a '#' character and empty lines are comment lines. The other lines are organized into fields, separated by a semicolon; initial and trailing white space in those fields is not significant. Both files are encoded in UTF-8, using U+000A as the line separator.

In IVD_Collections.txt, each line corresponds to an Ideographic Variation Collection, and there are two fields per line:

field 1: the identifier of a collection
field 2: the URL of a site describing the collection

In IVD_Sequences.txt, each line corresponds to an Ideographic Variation Sequence and there are four fields per line:

field 1: the code points of the base character and the the variation selector, separated by a space
field 2: the identifier of the collection under which the sequence is registered
field 3: the identifier of the sequence, provided by the registrant

The identifiers for collections and sequences are character strings starting with one of 'A'..'Z', 'a'..'z', and continuing with one of 'A'..'Z', 'a'..'z', '0'..'9', '_'.

4 Registration Procedure

4.1 Registration of a Collection

The registrant must first create a web page describing the intent of the collection, its principles, and any other data that may be useful for users of the collection.

Once the appropriate page is online, the intent to register the collection must be announced publicly in a manner designated by the registrar. This announcement must include the URL of the page(s) describing the proposed collection. This starts a review period of at least 90 days, during which comments and questions about the collection can be submitted to the registrant. The registrant should respond to these comments and questions.

At the end of the review period, the registrant can submit the registration form in Appendix A, together with a written and signed statement that:

the registrant will make reasonable efforts to maintain the stability of that URL and the site it points to
the variation sequences that will be registered in that collection can be used freely, without any limitation, fee or other requirement
all the comments and questions received during the review period have been addressed

Upon receipt of a complete application and the applicable fee if any, the registrar will assign a collection identifier (respecting as much as possible the suggested identifier), and add the collection to the Ideographic Variation Database.

Owners of collections can change the designated representative at any time by notifying the registrar. They can also change the URL of the web site they maintain by notifying the registrar. Ownership of a collection can be transferred to another party by notifying the registrar.

The registration of sequences in that collection can be started concurrently with the registration of the collection itself.

4.2 Registration of Sequences in a Collection

The registrant must first include the proposed sequences in the web page describing the collection (or some page pointing from it). Ideally, this description should include representative glyphs for proposed sequences.

Once the page is online, the intent to register the sequences must be announced on a mailing list designated by the registrar. This announcement must include the URL of the page(s) describing the proposed sequences. This starts a review period of 3 months, during which comments and questions about the sequences can be submitted to the mailing list. The registrant should respond to these comments and questions.

At the end of the review period, the registrant can submit the application for those sequences. This takes the form of a file in the format of IVD_Sequences.txt, except that the first field must contain only the code point of the base character.

Upon receipt of a complete application and the applicable fee if any, the registrar will assign a variation selector for each variation sequence, and add the sequences to the Ideographic Variation Database.

4.3 Registrar

[Ed note: registration authority vs. registrar]

The registrar will be appointed by the Unicode Consortium.

4.4 Registration Fees

The registrar may impose a non-refundable processing fee for the registration of collections and sequences. If a registration application is incomplete, the registrar will inform the registrant and accept one corrected application at no fee. Further corrections to the application may require an additional fee.

The registrar may wave the processing fee for Full members of the Consortium in good standing. This waiver may be limited to some number of collections and sequences for each calendar year. Full members of the Consortium may sponsor the registration of other parties.

The Unicode Consortium may register collections or sequences at any time without any processing fee.

Appendix A. Registration Form for Collections

Application for registration of an Ideographic Varition Sequence Collection

Name and address of the registrant:

Name and email address of the representative:

URL of the web site describing the collection:

Suggested identifier for the collection:

Pattern for the sequence identifiers:

Appendix B. Acknowledgments

Thanks to Mark Davis, Deborah Goldsmith, Tatsuo Kobayashi, Rick McGowan, Michel Suignard and Ken Whistler for their help developing the registry and their feedback on this document.

References

[Feedback]	http://www.unicode.org/reporting.html For reporting errors and requesting information online.
[Reports]	Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[Unicode]	The Unicode Standard, Version 4.0 (Boston, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1), as amended by Unicode 4.0.1 (http://www.unicode.org/versions/Unicode4.0.1) and by Unicode 4.1.0 (http://www.unicode.org/versions/Unicode4.1.0).
[Versions]	Versions of the Unicode Standard http://www.unicode.org/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them.

Modifications

This section indicates the changes introduced by each revision.

Revision 1

First version

Copyright © 2005 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.

Proposed Draft Unicode Technical Standard #37

Registration of Ideographic Variation Sequences

Summary

Status

Contents