L2/03-413

Ideograph Variation Selector and Variation Collection Identifier

Summary:	Proposal to change attribute of Variation Selector code range U+E0170 to U+E01EF and to add Ideograph variation collection identifier in plane 14
Version:	0.7
Last Updated:	2003-10-31
Editor:	Hideki Hiura
Authors:	Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida, Eric Muller, Ken Lunde, John Jenkins
Key Contributors:	Rick McGowan, Kenneth Whistler, Richard Cook, Tom Bishop, Michel Suignard, Takayuki K Sato
Latest Version:	0.7 2003-10-31 http://www.openi18n.org/spec/ivs/
Older Versions:	0.6 2003-09-20 Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida, Eric Muller, Ken Lunde, John Jenkins 0.5 2003-09-17 Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida, Eric Muller, Ken Lunde, John Jenkins 0.4 2003-09-01 Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida, Eric Muller, Ken Lunde, John Jenkins 0.3 2003-08-28 Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida 0.2 1998-07-29 L2/98-??? Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida 0.1 1997-12-01 L2/97-260 Hideki Hiura, Tatsuo Kobayashi
Related Contributions:	2003-08-23 L2/03-293 Eric Muller, Ken Lunde
Feedback:	[email protected](This address is only available during the open feedback period.)
Discussion List:	[email protected](Only subscribers can post. To subscribe, send an empty email to [email protected], and follow the instruction emailed back.)
Namespace:	http://www.openi18n.org/spec/ivs
Status:	This document is a proposed update of a part of Unicode 4.0 Standard, with a proposed addition. This is an unstable document and may not be used as reference material or cited as a normative reference by other specifications.

1. Introduction

Though there is a clear distinction between a concept of character and a concept of glyphic representation, in natural language processing, depending on the field of operation, the same text stream often requires different operational domains. Representative two domains are as follows:

Character/contents-based processing domain:

Search and Retrieval, sort, natural language processing, machine translation.

Appearance-based processing domain:

A person's name, a place name, a geographical dictionary, a biographical dictionary, historical/geographical description.

A different methodology applies to a different field, however, the appearance-based processing domain is often misunderstood as a simple font typeface issue within the character-based processing domain. The requirements in the appearance-based processing domain will not satisfactorily be fulfilled by ordinary font typeface settings in higher level protocols and even worse, there will still be a strong demand on the appearance-based processing domain for plain text. The Unicode Standard Variation Selectors is a great step toward the solution to the issues within the appearance-based processing domain, however, the requirements in this domain with Han ideographs involve further enhancement of the Unicode standard due to the nature of diverse usage of variations as common practice.

1.1 Problems and Requirements in use of Variation Selector for Ideographs

To fulfill all the requirements collectively in the appearance-based processing domain for Ideographs with current Variation Selectors, the following problems and requirements need to be resolved:

There needs to be a single comprehensive and definitive variation set and base character definition as a well-maintained standard which all who need to resolve the problems in appearance-based processing domain, including national bodies who use ideographs and include ideographs in their national standards as well as all different research and study purposes, if we have to go with a single set of variations as currently defined in Unicode 4.0 standard.
On the other hand, a huge comprehensive variation collection does not fit with the needs of those who only require target-specific smaller but efficient set of variations due to the cost of system and effectiveness, in both data processing and rendering, as too much is as bad as too little.
There had been several attempts in academic field and national standard field to address the issues of the appearance-based processing domain, and the study shows it is practically too costly to develop such comprehensive one-fits-all set, if not impossible to develop.
The studies show the fact that the taxonomy of ideograph variations are diverse due to their historical and geographical background. One taxonomy classified based on a particular set of historical documents may not necessarily be suitable for use with other historical documents written in different geographic regions and/or in different eras.
Requirements vary depending on the parties needing their own variations and often are conflicting among them.

To resolve the problems and requirements above, it is essential to introduce a new mechanism in Unicode Standard.

1.3 Amusing Example of needing different variation collection

A politician accepts donation in any possible variations of his name tries to distinguish all criminal charges on any variations of his name.

1.4 Definition

Variation Collection

Variation collection identifier sequence

2. Modification of property

This proposal addresses the need for modifying the property of a range of variation selectors VARIATION-SELECTOR-129 through VARIATION-SELECTOR-256(U+E0170 to U+E01EF).

The Unicode standard states in Section 15.6, Variation Selectors that:

Only the variation sequences specifically defined in the file StandardizedVariants.txt in the Unicode Character Database are sanctioned for standard use; in all other cases the variation selector cannot change the visual appearance of the preceding base character from what it would have had in the absence of the variation selector.

We propose to modify the Unicode standard only to apply this clause to VARIATION-SELECTOR-1 through VARIATION-SELECTOR-127.

2.1 Ideograph variation selector

[Change of semantics]
VARIATION-SELECTOR-129 through VARIATION-SELECTOR-256 can also change the visual appearance of the preceding base character, but variation sequences involving those characters are not defined in the file StandardizedVariants.txt in the Unicode Character Database. Instead, the meaning of such a variation sequence is defined by an outside agreement between producers and consumers, called a variation collection. To facilitate this agreement, the protocol described in the section 2.2 Variation collection identifier can be used to annotate a plain Unicode text document with variation collection data. As discussed in Section 5, the base character of a variation sequence involving VARIATION-SELECTOR-129 through VARIATION-SELECTOR-256 must have the Ideographic property. If the consumer of a document using variation sequences involving VARIATION-SELECTOR-129 through VARIATION-SELECTOR-256 conforms to the protocol by which the variation collection is identified, it can treat those variation selectors as ignorable characters, that is, not change the visual appearance of the base characters. Such a consumer must nevertheless respect the semantics of the base character, and should preserve the variation selectors in its output.

2.2 Variation collection identifier

Variation collection identifier is to facilitate an outside agreement between producers and consumers, called a variation collection, the plane-14 Ideograph variation collection identifier:

U+E0002 IDEOGRAPH VARIATION COLLECTION TAG

followed by the variation collection name identifying the collection of the variation collection spelled out in Tag Character (aka bump-up ASCII, U+E0020 through U+E007F) is used, unless the collection is specified in higher level protocols.

3. Format and Protocol

3.1 Start tag

A document or a block of documentation using particular Ideograph variation collection in plain text without using higher level protocol designates the variation collection name by using plane-14 Ideograph variation collection identifier followed by the variation collection name identifying the collection of the variation collection spelled out in Tag Characters.

3.2 Termination of variation name

No termination character is required for variation collection identifier sequence. Any codepoint outside of U+E0020 through U+E007F terminate the identifier sequence.

3.3 End tag

A document or a block of documentation using particular Ideograph variation collection in plain text without using a higher level protocol can terminate the previously specified variation collection by plane-14 Ideograph variation collection identifier without any variation collection name.

3.4 Higher Level Protocol

The designation of Ideograph variation collection by plane-14 Ideograph variation collection identifier is a higher level protocol for plain text. Any ideograph variation collection in other higher level protocols, such as in markup language, supersedes the designation of Ideograph variation collection by plane-14 Ideograph variation collection identifier in case of conflict of designation.

3.5 Default Ignorable

Systems that do not recognize this variation selector and variation set identifier sequence can ignore them and use their system default behavior.

3.6 Protocol

A block of documentation designates a particular Ideograph variation collection in plain text starting with the start tag and ending with either of EOF or designation of other start tag or end tag. Any text with Ideograph variation selector inside this sequence may be interpreted as a designation to the specified variation collection.

Documentation format
---------------------------------
[IVS SET start tag]+[Identifier name]
(any codepoint outside of U+E0020 through U+E007F terminates
 Identifier name)

    text


[EOF] or [Other [IVS SET start tag]+[Identifier name]] or [End tag]
------------------------------------------------------------

4. Other Properties

The other properties for VARIATION-SELECTOR-129 through VARIATION-SELECTOR-256 would remain the same:

 general category Mn
 canonical combining class 0 
 bidirectional category NSM 
 no decomposition mapping
 no numeric value
 not mirrored
 no case mapping

5. Base Character

The definition of base character has been one of the biggest problems for the earlier versions of Ideograph variation selector, because defining a single set of base characters itself is yet another controversial thesis which many researchers study in academic field. However, it is no longer essential to identify which character is worth being a base character. Allowing multiple variation collections identified by Variation set identifiers eliminates the dependency on the single definitive set of base characters and variation collections which require absolute accuracy with universal validity.

The base character in a variation sequence using VARIATION-SELECTOR-129 through VARIATION-SELECTOR-256 must be a Ideograph (CJK UNIFIED IDEOGRAPH *, and COMPATIBILITY IDEOGRAPH, including them in plane2). NON-CHARACTER and RESERVED in plane2 are excluded. The property of the base character must be as follows:

;Lo;0;L;;;;;N;;;;;

6. Registration Authority

There are several possibilities for registration authorities, such as Unicode Inc. specifies none and registration authorities are managed independently, or Unicode.org specifies one registration authority, etc, however, such detail is not a part of this proposal. The discussion on registration authority should be deferred until the principles of this proposal is accepted for advancement.

Registration Authority works as an arbitrator of name space management as well as the central directory to find out the variation collections. Those who do not need to disseminate their collections to public because the use of their own collections is limited to the closed members of their group, may not find the necessity to register their collection name and may use their own collections without registration by using their own URI for the collection as collection name, as described as one of possible naming conventions, in the Section 7 Naming Convention of Identifiers.

7. Naming Convention of Identifiers

Variation collection identifiers can be in any form as far as it does not conflict, and the registration authority should be able to define the convention, however such detail is not a part of this proposal. Naming conventions should be discussed once this proposal is accepted and on the standard track.

Note, there are several possibilities; such as use of URI, assigning number, or free strings.

7.1 URI

Use of URI is the only possible method to allow parties who do not need to disseminate their collections to the public because the use of their own collections is limited to the closed members of their group to use their collections safely without registering with a registration authority. The nature of the URI itself technically prevents name space conflicts among other collections even though they are not registered. The parties who wish to disseminate their collections to public should register the collection name even though they use a URI for naming convention.

7.2 Assigned Number

The merit of assuaged number is the efficiency of the name, and there is no alternative method to manage name space other than registering it with a Registration authority.

7.3 Free String

The merit of a free string is the intuitiveness of the name, and there is no alternative method to manage name space other than registering it with a Registration authority.

8. Sample Collections and sample identifiers in three forms

The following is the sample collections of variation collection
identifiers in the case that it is specified in free strings.

 Collection 1 Adobe-Japan1-5

  in assigned number:

    U+E0002 U+E0031

  in free string:

    U+E0002 U+E0041 U+E0064 U+E006f U+E0062 U+E0069 U+E002d 
    U+E004a U+E0061 U+E0070 U+E0061 U+E006e U+E0031 U+E002d
    U+E0035 

  in URI:

    U+E0002 [To be filled]

 Collection 2 Adobe-Japan1-6

  in assigned number:

    U+E0002 U+E0032

  in free string:

    U+E0002 U+E0041 U+E0064 U+E006f U+E0062 U+E0069 U+E002d 
    U+E004a U+E0061 U+E0070 U+E0061 U+E006e U+E0031 U+E002d
    U+E0036 

  in URI:

    U+E0002 [To be filled]

 Collection 3 IPSJ-TS0002:2002 (Jyoho shori gakkai - Information
	      Processing Society of Japan TS Character Shapes
	      Identification aka Mojikyo 文字鏡) 

  in assigned number:

    U+E0002 U+E0033

  in free string:

    U+E0002 U+E0049 U+E0050 U+E0053 U+E004a U+E002d U+E0054 
    U+E0053 U+E0030 U+E0030 U+E0030 U+E0032 U+E003a U+E0032 
    U+E0030 U+E0030 U+E0032

  in URI:

    U+E0002 [To be filled]

 Collection 4 JAET (Japan Association for East Asian Text Processing
	      - http://www.jaet.gr.jp Kanji Bunken Jyouhou Syori
	        Kenkyu-kai 漢字文献情報処理研究会 - research society
		for traditional Buddism text)

  in assigned number:

    U+E0002 U+E0034

  in free string:

    U+E0002 U+E004a U+E0041 U+E0045 U+E0054

  in URI:

    U+E0002 [To be filled]

 Collection 5 LASDEC-BRRNS (Local Authoritities Systems Development Center
	      Basic Residential Registers Network System,
	      地方自治情報センター住民基本台帳ネットワークシステム）

  in assigned number:

    U+E0002 U+E0035

  in free string:

    U+E0002 U+E004c U+E0041 U+E0053 U+E0044 U+E0045 U+E0043 
    U+E002d U+E0042 U+E0052 U+E0052 U+E004e U+E0053 

  in URI:

    U+E0002 [To be filled]

 Collection 6 Tripitaka Koreana Woodblocks (高麗大蔵経木版)

  in assigned number:

    U+E0002 U+E0036

  in free string:

    U+E0002 U+E0054 U+E0072 U+E0069 U+E0070 U+E0069 U+E0074 
    U+E0061 U+E006b U+E0061 

  in URI:

    U+E0002 [To be filled]

Other examples...

 Collection 7 MOJJ-CIAB Ministry of Justice Japan Civil Affairs Bureau
 Collection 8 MOJJ-CRAB Ministry of Justice Japan Criminal Affairs Bureau
 Collection 9 MOJJ-IMAB Ministry of Justice Japan Civil Affairs Bureau

9. Pages from Sample Collection

The following is the samples from two widely used collections, which illustrate how different each collections are, and how controversial would be to agree on single collection by all parties needing their own collection based on their own criteria.

Adobe-Japan1-5 is commercially very popular collection widely deployed in printing/publishing industry in Japan.

Samples from Adobe-Japan1-5 in pdf

Mojikyo collection is another very popular collection well known as one of the largest set of variation collections, which poses quite controversial issue how to cross-map between Unicode.

Samples from Mojikyo in pdf

10.Acknowledgements and modification history

[2003-8-28] The idea of using existing variation selector by changing property instead of creating separate set of ideograph variation selector is the contribution form Rick McGowan, Kenneth Whistler.

[2003-8-29] The idea of using end tag instead of leaving the end of ideograph variation collection unspecified is the contribution form Rick McGowan.

[2003-8-29] The idea of separating registration authority discussion from this proposal is the contribution form Mike Ksar

[2003-9-16] The idea of inclusion of the use of URI as normative part of this specification instead of separate specification for the case that skipping registration authority is the contribution from Takayuki K Sato

[2003-10-23] Kenneth Whistler suggested to spell out clearly that the Variation collection identifier using Tag Character is also a higher level protocol implemented in plain text.

[2003-10-23] Kenneth Whistler suggested to split this proposal into two; One for IVS semantics change proposal only, one for the rest, in order for advancing this proposal to UAX/UTS/UTR.

Ideograph Variation Selector ad-hoc group is:

Hideki Hiura, OpenI18N.org, Sun Microsystems,
Tatsuo Kobayashi, Justsystem
Yasuo Kida, Apple Computer
Eric Muller, Adobe Systems
Ken Lunde, Adobe Systems
Michel Suignard, Microsoft Corp
John Jenkins, Apple Computer
Rick McGowan, Unicode Consortium
Kenneth Whistler, Sybase
Mike Ksar, Microsoft Corp
Richard Cook, UC Berkeley
Tom Bishop, Wenlin Institute, Inc.
Dirk Meyer, Adobe Systems
John Renner, Adobe Systems
Deborah Goldsmith, Apple Computer
Yasuhira Anan, Microsoft Corp
Cathy Wissink, Microsoft Corp
Jim DeLaHunt, Adobe Systems
Lee Collins, Apple Computer

Variation Collection
Variation collection identifier sequence