|Summary:||Proposal to change attribute of Variation Selector code range U+E0170 to U+E01EF and to add Ideograph variation collection identifier in plane 14|
|Authors:||Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida, Eric Muller, Ken Lunde, John Jenkins|
|Key Contributors:||Rick McGowan, Kenneth Whistler, Richard Cook, Tom Bishop, Michel Suignard, Takayuki K Sato|
|Latest Version:||0.7 2003-10-31 http://www.openi18n.org/spec/ivs/|
0.6 2003-09-20 Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida,
Eric Muller, Ken Lunde, John Jenkins
0.5 2003-09-17 Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida, Eric Muller, Ken Lunde, John Jenkins
0.4 2003-09-01 Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida, Eric Muller, Ken Lunde, John Jenkins
0.3 2003-08-28 Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida
0.2 1998-07-29 L2/98-??? Hideki Hiura, Tatsuo Kobayashi, Yasuo Kida
0.1 1997-12-01 L2/97-260 Hideki Hiura, Tatsuo Kobayashi
2003-08-23 L2/03-293 Eric Muller, Ken Lunde
|Feedback:||email@example.com(This address is only available during the open feedback period.)|
|Discussion List:||firstname.lastname@example.org(Only subscribers can post. To subscribe, send an empty email to email@example.com, and follow the instruction emailed back.)|
|Status:||This document is a proposed update of a part of Unicode 4.0 Standard, with a proposed addition. This is an unstable document and may not be used as reference material or cited as a normative reference by other specifications.|
Though there is a clear distinction between a concept of character and a concept of glyphic representation, in natural language processing, depending on the field of operation, the same text stream often requires different operational domains. Representative two domains are as follows:
Search and Retrieval, sort, natural language processing, machine translation.
A person's name, a place name, a geographical dictionary, a biographical dictionary, historical/geographical description.
A different methodology applies to a different field, however, the appearance-based processing domain is often misunderstood as a simple font typeface issue within the character-based processing domain. The requirements in the appearance-based processing domain will not satisfactorily be fulfilled by ordinary font typeface settings in higher level protocols and even worse, there will still be a strong demand on the appearance-based processing domain for plain text. The Unicode Standard Variation Selectors is a great step toward the solution to the issues within the appearance-based processing domain, however, the requirements in this domain with Han ideographs involve further enhancement of the Unicode standard due to the nature of diverse usage of variations as common practice.
To resolve the problems and requirements above, it is essential to introduce a new mechanism in Unicode Standard.
There needs to be a single comprehensive and definitive variation set and base character definition as a well-maintained standard which all who need to resolve the problems in appearance-based processing domain, including national bodies who use ideographs and include ideographs in their national standards as well as all different research and study purposes, if we have to go with a single set of variations as currently defined in Unicode 4.0 standard. On the other hand, a huge comprehensive variation collection does not fit with the needs of those who only require target-specific smaller but efficient set of variations due to the cost of system and effectiveness, in both data processing and rendering, as too much is as bad as too little. There had been several attempts in academic field and national standard field to address the issues of the appearance-based processing domain, and the study shows it is practically too costly to develop such comprehensive one-fits-all set, if not impossible to develop. The studies show the fact that the taxonomy of ideograph variations are diverse due to their historical and geographical background. One taxonomy classified based on a particular set of historical documents may not necessarily be suitable for use with other historical documents written in different geographic regions and/or in different eras. Requirements vary depending on the parties needing their own variations and often are conflicting among them.
A politician accepts donation in any possible variations of his name tries to distinguish all criminal charges on any variations of his name.
|Variation collection identifier sequence|
This proposal addresses the need for modifying the property of a range of variation selectors VARIATION-SELECTOR-129 through VARIATION-SELECTOR-256(U+E0170 to U+E01EF).
The Unicode standard states in Section 15.6, Variation Selectors that:
Only the variation sequences specifically defined in the file StandardizedVariants.txt in the Unicode Character Database are sanctioned for standard use; in all other cases the variation selector cannot change the visual appearance of the preceding base character from what it would have had in the absence of the variation selector.We propose to modify the Unicode standard only to apply this clause to VARIATION-SELECTOR-1 through VARIATION-SELECTOR-127.
[Change of semantics]
VARIATION-SELECTOR-129 through VARIATION-SELECTOR-256 can also change the visual appearance of the preceding base character, but variation sequences involving those characters are not defined in the file StandardizedVariants.txt in the Unicode Character Database. Instead, the meaning of such a variation sequence is defined by an outside agreement between producers and consumers, called a variation collection. To facilitate this agreement, the protocol described in the section 2.2 Variation collection identifier can be used to annotate a plain Unicode text document with variation collection data. As discussed in Section 5, the base character of a variation sequence involving VARIATION-SELECTOR-129 through VARIATION-SELECTOR-256 must have the Ideographic property. If the consumer of a document using variation sequences involving VARIATION-SELECTOR-129 through VARIATION-SELECTOR-256 conforms to the protocol by which the variation collection is identified, it can treat those variation selectors as ignorable characters, that is, not change the visual appearance of the base characters. Such a consumer must nevertheless respect the semantics of the base character, and should preserve the variation selectors in its output.
Variation collection identifier is to facilitate an outside agreement between producers and consumers, called a variation collection, the plane-14 Ideograph variation collection identifier:
U+E0002 IDEOGRAPH VARIATION COLLECTION TAGfollowed by the variation collection name identifying the collection of the variation collection spelled out in Tag Character (aka bump-up ASCII, U+E0020 through U+E007F) is used, unless the collection is specified in higher level protocols.
A document or a block of documentation using particular Ideograph variation collection in plain text without using higher level protocol designates the variation collection name by using plane-14 Ideograph variation collection identifier followed by the variation collection name identifying the collection of the variation collection spelled out in Tag Characters.
No termination character is required for variation collection identifier sequence. Any codepoint outside of U+E0020 through U+E007F terminate the identifier sequence.
A document or a block of documentation using particular Ideograph variation collection in plain text without using a higher level protocol can terminate the previously specified variation collection by plane-14 Ideograph variation collection identifier without any variation collection name.
The designation of Ideograph variation collection by plane-14 Ideograph variation collection identifier is a higher level protocol for plain text. Any ideograph variation collection in other higher level protocols, such as in markup language, supersedes the designation of Ideograph variation collection by plane-14 Ideograph variation collection identifier in case of conflict of designation.
Systems that do not recognize this variation selector and variation set identifier sequence can ignore them and use their system default behavior.
A block of documentation designates a particular Ideograph variation collection in plain text starting with the start tag and ending with either of EOF or designation of other start tag or end tag. Any text with Ideograph variation selector inside this sequence may be interpreted as a designation to the specified variation collection.
Documentation format --------------------------------- [IVS SET start tag]+[Identifier name] (any codepoint outside of U+E0020 through U+E007F terminates Identifier name) text [EOF] or [Other [IVS SET start tag]+[Identifier name]] or [End tag] ------------------------------------------------------------
The other properties for VARIATION-SELECTOR-129 through VARIATION-SELECTOR-256 would remain the same:
general category Mn canonical combining class 0 bidirectional category NSM no decomposition mapping no numeric value not mirrored no case mapping
The definition of base character has been one of the biggest problems for the earlier versions of Ideograph variation selector, because defining a single set of base characters itself is yet another controversial thesis which many researchers study in academic field. However, it is no longer essential to identify which character is worth being a base character. Allowing multiple variation collections identified by Variation set identifiers eliminates the dependency on the single definitive set of base characters and variation collections which require absolute accuracy with universal validity.
The base character in a variation sequence using VARIATION-SELECTOR-129 through VARIATION-SELECTOR-256 must be a Ideograph (CJK UNIFIED IDEOGRAPH *, and COMPATIBILITY IDEOGRAPH, including them in plane2). NON-CHARACTER and RESERVED in plane2 are excluded. The property of the base character must be as follows:
There are several possibilities for registration authorities, such as Unicode Inc. specifies none and registration authorities are managed independently, or Unicode.org specifies one registration authority, etc, however, such detail is not a part of this proposal. The discussion on registration authority should be deferred until the principles of this proposal is accepted for advancement.
Registration Authority works as an arbitrator of name space management as well as the central directory to find out the variation collections. Those who do not need to disseminate their collections to public because the use of their own collections is limited to the closed members of their group, may not find the necessity to register their collection name and may use their own collections without registration by using their own URI for the collection as collection name, as described as one of possible naming conventions, in the Section 7 Naming Convention of Identifiers.
Variation collection identifiers can be in any form as far as it does not conflict, and the registration authority should be able to define the convention, however such detail is not a part of this proposal. Naming conventions should be discussed once this proposal is accepted and on the standard track.
Note, there are several possibilities; such as use of URI, assigning number, or free strings.
The following is the sample collections of variation collection identifiers in the case that it is specified in free strings. Collection 1 Adobe-Japan1-5 in assigned number: U+E0002 U+E0031 in free string: U+E0002 U+E0041 U+E0064 U+E006f U+E0062 U+E0069 U+E002d U+E004a U+E0061 U+E0070 U+E0061 U+E006e U+E0031 U+E002d U+E0035 in URI: U+E0002 [To be filled] Collection 2 Adobe-Japan1-6 in assigned number: U+E0002 U+E0032 in free string: U+E0002 U+E0041 U+E0064 U+E006f U+E0062 U+E0069 U+E002d U+E004a U+E0061 U+E0070 U+E0061 U+E006e U+E0031 U+E002d U+E0036 in URI: U+E0002 [To be filled] Collection 3 IPSJ-TS0002:2002 (Jyoho shori gakkai - Information Processing Society of Japan TS Character Shapes Identification aka Mojikyo 文字鏡) in assigned number: U+E0002 U+E0033 in free string: U+E0002 U+E0049 U+E0050 U+E0053 U+E004a U+E002d U+E0054 U+E0053 U+E0030 U+E0030 U+E0030 U+E0032 U+E003a U+E0032 U+E0030 U+E0030 U+E0032 in URI: U+E0002 [To be filled] Collection 4 JAET (Japan Association for East Asian Text Processing - http://www.jaet.gr.jp Kanji Bunken Jyouhou Syori Kenkyu-kai 漢字文献情報処理研究会 - research society for traditional Buddism text) in assigned number: U+E0002 U+E0034 in free string: U+E0002 U+E004a U+E0041 U+E0045 U+E0054 in URI: U+E0002 [To be filled] Collection 5 LASDEC-BRRNS (Local Authoritities Systems Development Center Basic Residential Registers Network System, 地方自治情報センター住民基本台帳ネットワークシステム） in assigned number: U+E0002 U+E0035 in free string: U+E0002 U+E004c U+E0041 U+E0053 U+E0044 U+E0045 U+E0043 U+E002d U+E0042 U+E0052 U+E0052 U+E004e U+E0053 in URI: U+E0002 [To be filled] Collection 6 Tripitaka Koreana Woodblocks (高麗大蔵経木版) in assigned number: U+E0002 U+E0036 in free string: U+E0002 U+E0054 U+E0072 U+E0069 U+E0070 U+E0069 U+E0074 U+E0061 U+E006b U+E0061 in URI: U+E0002 [To be filled] Other examples... Collection 7 MOJJ-CIAB Ministry of Justice Japan Civil Affairs Bureau Collection 8 MOJJ-CRAB Ministry of Justice Japan Criminal Affairs Bureau Collection 9 MOJJ-IMAB Ministry of Justice Japan Civil Affairs Bureau
The following is the samples from two widely used collections, which illustrate how different each collections are, and how controversial would be to agree on single collection by all parties needing their own collection based on their own criteria.
Adobe-Japan1-5 is commercially very popular collection widely deployed in printing/publishing industry in Japan.
Mojikyo collection is another very popular collection well known as one of the largest set of variation collections, which poses quite controversial issue how to cross-map between Unicode.
[2003-8-28] The idea of using existing variation selector by changing property instead of creating separate set of ideograph variation selector is the contribution form Rick McGowan, Kenneth Whistler.
[2003-8-29] The idea of using end tag instead of leaving the end of ideograph variation collection unspecified is the contribution form Rick McGowan.
[2003-8-29] The idea of separating registration authority discussion from this proposal is the contribution form Mike Ksar
[2003-9-16] The idea of inclusion of the use of URI as normative part of this specification instead of separate specification for the case that skipping registration authority is the contribution from Takayuki K Sato
[2003-10-23] Kenneth Whistler suggested to spell out clearly that the Variation collection identifier using Tag Character is also a higher level protocol implemented in plain text.
[2003-10-23] Kenneth Whistler suggested to split this proposal into two; One for IVS semantics change proposal only, one for the rest, in order for advancing this proposal to UAX/UTS/UTR.
Hideki Hiura, OpenI18N.org, Sun Microsystems,
Tatsuo Kobayashi, Justsystem
Yasuo Kida, Apple Computer
Eric Muller, Adobe Systems
Ken Lunde, Adobe Systems
Michel Suignard, Microsoft Corp
John Jenkins, Apple Computer
Rick McGowan, Unicode Consortium
Kenneth Whistler, Sybase
Mike Ksar, Microsoft Corp
Richard Cook, UC Berkeley
Tom Bishop, Wenlin Institute, Inc.
Dirk Meyer, Adobe Systems
John Renner, Adobe Systems
Deborah Goldsmith, Apple Computer
Yasuhira Anan, Microsoft Corp
Cathy Wissink, Microsoft Corp
Jim DeLaHunt, Adobe Systems
Lee Collins, Apple Computer