L2/01-194 Source: Eric Muller on Date: 05/07/2001 09:20:17 AM Proposal: Formalizing the Unicode Private Use Area I'd like to add the following to the UTC meeting agenda: Formalizing the Unicode Private Use Area. I have attached a draft proposal I wrote a while back. Given some recent feedback, let me preface this document a little bit: I firmly believe in the idea of Unicode being a universal collection of characters, that every character will ultimately be part of that collection, and that the first order of business when encountering a not-yet registered character is to give it a Unicode semantics (i.e. values of properties) and submit it for inclusion in Unicode. We also have to recognize that this process is not instantaneous, and there is currently a void in the interim period. When the user community for the characters is small, well-connected and uses a small number of tools (e.g. a group of academics working on a new script), an informal agreement on the meaning of the characters is usually enough. However, there are cases where the informal process breaks down. For example, the PRC is on track to mandate the support of GB 18030 for products sold in the PRC; this standard includes a (small) number of characters that are not part of Unicode 3.1, and defines mappings to the PUA. There are simply too many players for the informal agreement to occur. I believe it would be advantageous to have a description of the PUA characters that gives them a complete Unicode semantics. If Unicode had a mechanism similar to the one proposed here, it would have encouraged the PRC to provide Unicode semantics for the not-yet-in-Unicode characters. Products could have mechanisms to extend their built-in Unicode database to those characters, helping in the interim period. As importantly, the proposal for inclusion of the characters in Unicode would be half done. And there would be something to help the processing of legacy documents after the inclusion of the characters. Eric. Table of Content 1. Motivation 2. Terminology 3. Requirements 4. Overall Structure 5. Characters 6. Collections 7. Related work 1. Motivation The Unicode standard is a constantly evolving character collection, and there may be times when one needs a character that is not yet part of the standard. Unicode recognizes this situation: [p23] A contiguous area of codes has been set aside for private use. Characters in this area will never be defined by the Unicode Standard. These codes can be freely used for characters of any purpose, but successful interchange requires an agreement between sender and receiver on their interpretation. Indeed, a document that uses PUA code points does not have a meaning by itself, just like a document where the encoding is not specified has no meaning by itself. First and foremost, this note provides a mean to build those agreements. The idea is that a document could specify a semantics for the Private Use Area characters it contains, at the same level as Unicode specifies a semantics for the assigned characters (i.e. those that are part of the Unicode repertoire). Just like Unicode, part of the semantics is formalized and represented in a machine readable form, and part of it is informal. 2. Terminology A gaiji character is a character that is not part of the Unicode repertoire and is encoded in the PUA. In this document, there is no intention to restrict gaiji characters to ideographs. Of course, this notion is relative to a particular version of the Unicode standard. 3. Requirements The design goals are: Define a syntax to describe the formal part of the Unicode semantics of characters. By describing gaiji characters in that way, they can become full participants in Unicode processing. For example, one could indicate that a new character COMBINING REVERSE SOLIDUS OVERLAY is in combinining class 1 and processors that deal with combinations would do the right thing on this character. (By the way this is not an innocent example: this character was accepted for inclusion in June 1999, but will be part of Unicode only after version 3.0; so right now, it's a gaiji.) Make that syntax extensible, so that additional properties can be attached. For example, there could be indications for Input Method Editors on how they should let the user input those characters. Define a syntax to organize character descriptions in collections and to combine collections. Consider the case where Alice's document uses one collection of private characters, and Bob's document uses another one, and Charles creates a document that combines Alice's Bob's documents. While this example may seems contrieved, replace persons by machines and it suddenly looks a lot more real. Make that syntax extensible, so that additional properties can be attached. For example, a collection could indicate where an appropriate font could be found. Make these descriptions human-legible and easy to process by programs. Practically this means that descriptions can be built using simple tools such as text editors, yet they can be incorporated in sophisticated document processing systems. Allow two collections to overlap (i.e. to assign the same value to one code point) to avoid central administration, and provide a mechanism to reconcile them. What if Alice and Bob both used U+E732 in their collections? Support the naming and referencing of character collections, in particular over the Internet. Clearly, there will be collections of gaiji characters that will be used in a number of documents. Repeating all the character descriptions in all the documents would be a logistical nightmare. In addition, it would make it difficult to know if the code value U+E732 represents the same character in two different documents. At least, if both documents reference the same collection (or more precisely, if the code value was assigned by the same subcollection), this guarantee can be given. Define mechanisms to incorporate or attach references to character collections to documents. 4. Overall Structure The goals dealing with extensibility, human readability, and machine processing are easily satisfied by using XML.This document describes a DTD. Open: Should we go directly to XML Schema instead? Open: Usual questions about DTDs: what characters should we use in element names (-, _, camelCase)? elements or attributes?. Open: Use namespaces for the extensions? 5. Characters The unicode-name element encloses the Unicode name of that character. It is not applicable to gaiji characters. The name element is used for non-Unicode characters. Exactly one of unicode-name and name must be present. The unicode-1.0-name element encloses the Unicode 1.0 name of the character, if it exists. The alternative-names element encloses a set of alternative-name elements, which in turn enclose alternative names for this character. The code element encloses the Unicode code value of the character, using the U+xxxx syntax. The char element contains a single character, which is the character itself. The cross-references element encloses a set of cross-ref elements. Each cross-ref element contains a code element and a name element for the character which is referenced. The cross-ref element has a role attribute which can take the values inequal or other. The default value for that attribute is other. The compatibility-decomposition element contains a sequence of characters into which the character being described can be compatibly decomposed. The canonical-decomposition element contains the characters into which the character being described is canonically decomposed. case can have the values UPPERCASE, TitleCase or lowercase. combining-class encloses the combining class (in its numeric form). directionality encloses the directionality property. jamo-short-name encloses the Jamo short name property. It can be present only for Unicode conjoining Hangul jamo characters. general-category numeric-values is present if the character is a number. It encloses the numeric value as recorded in section 4.6. In addition, the attribute value is the numeric value represented as a decimal number, without ',' to separate the character groups. The attribute decimal can take the values yes or no. mirrored is present for those characters that have the mirrored property. mathematical is present for characters that have the mathematical property. Open: Look at the other properties on the Unicode cdrom in proplist.txt. decimal CDATA #IMPLIED> The informative-note element contains an informative note. Open: What should be the DTD in there? A fragment of docbook? The itsy bitsy dtd? Finally, these elements are assembled in a character element: Here are some examples: LATIN CAPITAL LETTER A U+0041 A LR COMBINING REVERSE SOLIDUS OVERLAY U+E000 1 DOLLAR SIGN milreis escudo U+0024 $ LR currency sign0A4 Glyph may have one or two vertical bars. other currency symbol characters: 20A0 â' - 20AF â'¯ 6. Collections Collections are formed by grouping characters and by combining collections. A collection is well-formed iff: No two characters have the same name, where the name of a character is defined as the value of the unicode-name element or the value of name element, whichever is present. No two characters have the same code. An enumerated-collection is just a set of character elements. A ref-collection references a external collection (that is, external to the resource in which this reference occurs). It must have a system identifier, an URI, which may be used to retrieve the referenced collection. Relative URIs are relative to the location of resource within which the ref-collection occurs. In addition, there may be a public identifier. A processor attempting to retrieve the referenced collection may use the public identifier to try to generate an alternative URI. If the processor is unable to do so, it must use the URI specified in the system identifier. A union-collection groups the characters of multiple collections. If the set-wise union of those collections are not well-formed, characters of the later collections are removed from the union. A subsetted-collection removes some the characters of a base collection. The characters to remove are identified by their code value. A remapped-collection reassigns new code points to the characters of a base collection. A simple-map just lists pairs of code points. Characters which are not listed as the source of a pair are mapped to their original code point. No two pairs should map from the same character. The map should not assign two different characters to the same code point. A shift-map adds an offset (positive or negative to each code point. By construction it preserves well-formedness. These are the only maps: And this complete the means of constructing collections: Here are some examples. COMBINING REVERSE SOLIDUS OVERLAY U+E000 1 Here is another collection that uses the same PUA code point, but defines it differently: Adobe Logo U+E000 1 Let's assume that our first collection is accessible via the URI http://atm.corp.adobe.com/chc/eric.chc and the second is accessible via the URI http://oranda.corp.adobe.com/chc/adobecorp.chc. Just forming the union of those collections will drop one of the two PUA characters (the one in the collection mentionned second). The following collection can be built for documents that need both PUA characters: In documents that use this collection, the code point U+E000 refers to the Adobe Logo character, and the code point U+E001 refers to the COMBINING REVERSE SOLIDUS OVERLAY characters. 7. Related work The first source of inspiration is the XML world. In an XML document, the element names that are used have no particular meaning by themselves, just like the PUA code points have no meaning. But in the XML world, this is the norm rather than the exception and mechanisms have been designed to cope with that. In fact, these were a major source of inspiration: DTD and XML schemas are similar character collections, namespaces correspond to the collection bases, and the collection naming and referencing is based on DTD naming and referencing. The W3C NOTE A Notation for Character Collections for the WWW by Martin Dürst is an XML DTD to describe sets of character code values. The main objective is to be able to answer the question "Is this character code in this collection?". Particular attention is paid to support efficient implementation when the set descriptions are resources on the network. While this is useful when the sets are made of standard characters, it's really not enough to deal with private use characters, as it does not attach a meaning to them. The ConScript Unicode Registry by John Cowan and Michael Everson is a registry of Private Use Area uses. The goal of this effort is really to have a centralized allocation of the private use area. It does not attempt to record semantics of the characters. Copyright (c) 2001 Adobe Systems Inc.