L2/01-194


Source:  Eric Muller <emuller@Adobe.COM> on 
Date:  05/07/2001 09:20:17 AM

Proposal:  Formalizing the Unicode Private Use Area 

I'd like to add the following to the UTC meeting agenda: Formalizing
the Unicode Private Use Area. I have attached a draft proposal I wrote
a while back. Given some recent feedback, let me preface this document
a little bit:


I firmly believe in the idea of Unicode being a universal collection
of characters, that every character will ultimately be part of that
collection, and that the first order of business when encountering a
not-yet registered character is to give it a Unicode semantics (i.e.
values of properties) and submit it for inclusion in Unicode.

We also have to recognize that this process is not instantaneous, and
there is currently a void in the interim period. When the user
community for the characters is small, well-connected and uses a small
number of tools (e.g. a group of academics working on a new script),
an informal agreement on the meaning of the characters is usually
enough. However, there are cases where the informal process breaks
down. For example, the PRC is on track to mandate the support of GB
18030 for products sold in the PRC; this standard includes a (small)
number of characters that are not part of Unicode 3.1, and defines
mappings to the PUA. There are simply too many players for the
informal agreement to occur.

I believe it would be advantageous to have a description of the PUA
characters that gives them a complete Unicode semantics. If Unicode
had a mechanism similar to the one proposed here, it would have
encouraged the PRC to provide Unicode semantics for the
not-yet-in-Unicode characters. Products could have mechanisms to
extend their built-in Unicode database to those characters, helping in
the interim period. As importantly, the proposal for inclusion of the
characters in Unicode would be half done. And there would be something
to help the processing of legacy documents after the inclusion of the
characters.


Eric.


Table of Content

                                                                           
  1.                                    Motivation                         
                                                                           
  2.                                    Terminology                        
                                                                           
  3.                                    Requirements                       
                                                                           
  4.                                    Overall Structure                  
                                                                           
  5.                                    Characters                         
                                                                           
  6.                                    Collections                        
                                                                           
  7.                                    Related work                       
                                                                           

1. Motivation


The Unicode standard is a constantly evolving character collection, and
there may be times when one needs a character that is not yet part of the
standard. Unicode recognizes this situation:

[p23] A contiguous area of codes has been set aside for private use.
Characters in this area will never be defined by the Unicode Standard.
These codes can be freely used for characters of any purpose, but
successful interchange requires an agreement between sender and receiver on
their interpretation.

Indeed, a document that uses PUA code points does not have a meaning by
itself, just like a document where the encoding is not specified has no
meaning by itself.

First and foremost, this note provides a mean to build those agreements.
The idea is that a document could specify a semantics for the Private Use
Area characters it contains, at the same level as Unicode specifies a
semantics for the assigned characters (i.e. those that are part of the
Unicode repertoire). Just like Unicode, part of the semantics is formalized
and represented in a machine readable form, and part of it is informal.


2. Terminology


A gaiji character is a character that is not part of the Unicode repertoire
and is encoded in the PUA. In this document, there is no intention to
restrict gaiji characters to ideographs. Of course, this notion is relative
to a particular version of the Unicode standard.


3. Requirements


The design goals are:

   Define a syntax to describe the formal part of the Unicode semantics of
   characters. By describing gaiji characters in that way, they can become
   full participants in Unicode processing. For example, one could indicate
   that a new character COMBINING REVERSE SOLIDUS OVERLAY is in combinining
   class 1 and processors that deal with combinations would do the right
   thing on this character. (By the way this is not an innocent example:
   this character was accepted for inclusion in June 1999, but will be part
   of Unicode only after version 3.0; so right now, it's a gaiji.)

   Make that syntax extensible, so that additional properties can be
   attached. For example, there could be indications for Input Method
   Editors on how they should let the user input those characters.

   Define a syntax to organize character descriptions in collections and to
   combine collections. Consider the case where Alice's document uses one
   collection of private characters, and Bob's document uses another one,
   and Charles creates a document that combines Alice's Bob's documents.
   While this example may seems contrieved, replace persons by machines and
   it suddenly looks a lot more real.

   Make that syntax extensible, so that additional properties can be
   attached. For example, a collection could indicate where an appropriate
   font could be found.

   Make these descriptions human-legible and easy to process by programs.
   Practically this means that descriptions can be built using simple tools
   such as text editors, yet they can be incorporated in sophisticated
   document processing systems.

   Allow two collections to overlap (i.e. to assign the same value to one
   code point) to avoid central administration, and provide a mechanism to
   reconcile them. What if Alice and Bob both used U+E732 in their
   collections?

   Support the naming and referencing of character collections, in
   particular over the Internet. Clearly, there will be collections of
   gaiji characters that will be used in a number of documents. Repeating
   all the character descriptions in all the documents would be a
   logistical nightmare. In addition, it would make it difficult to know if
   the code value U+E732 represents the same character in two different
   documents. At least, if both documents reference the same collection (or
   more precisely, if the code value was assigned by the same
   subcollection), this guarantee can be given.

   Define mechanisms to incorporate or attach references to character
   collections to documents.


4. Overall Structure


The goals dealing with extensibility, human readability, and machine
processing are easily satisfied by using XML.This document describes a DTD.

                                                                           
  Open:                                                                    
                                                                           
                                       Should we go directly to XML Schema 
                                       instead?                            
                                                                           
                                                                           
  Open:                                                                    
                                                                           
                                       Usual questions about DTDs: what    
                                       characters should we use in element 
                                       names (-, _, camelCase)? elements   
                                       or attributes?.                     
                                                                           
                                                                           
  Open:                                                                    
                                                                           
                                       Use namespaces for the extensions?  
                                                                           
                                                                           
5. Characters


The unicode-name element encloses the Unicode name of that character. It is
not applicable to gaiji characters.

The name element is used for non-Unicode characters.

Exactly one of unicode-name and name must be present.

The unicode-1.0-name element encloses the Unicode 1.0 name of the
character, if it exists.

The alternative-names element encloses a set of alternative-name elements,
which in turn enclose alternative names for this character.

<!ELEMENT unicode-name (#PCDATA)>

<!ELEMENT unicode-1.0-name (#PCDATA)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT alternative-names (alternative-name)*>
<!ELEMENT alternative-name  (#PCDATA)>

The code element encloses the Unicode code value of the character, using
the U+xxxx syntax.

The char element contains a single character, which is the character
itself.

<!ELEMENT code (#PCDATA)>

<!ELEMENT char (#PCDATA)>

The cross-references element encloses a set of cross-ref elements. Each
cross-ref element contains a code element and a name element for the
character which is referenced. The cross-ref element has a role attribute
which can take the values inequal or other. The default value for that
attribute is other.

<!ELEMENT cross-references (cross-ref)*>

<!ELEMENT cross-ref (code, name)>
<!ATTLIST cross-ref role CDATA #IMPLIED>

The compatibility-decomposition element contains a sequence of characters
into which the character being described can be compatibly decomposed.

The canonical-decomposition element contains the characters into which the
character being described is canonically decomposed.

<!ELEMENT compatibility-decomposition (#PCDATA)>

<!ELEMENT canonical-decomposition (#PCDATA)>

case can have the values UPPERCASE, TitleCase or lowercase.

combining-class encloses the combining class (in its numeric form).

directionality encloses the directionality property.

jamo-short-name encloses the Jamo short name property. It can be present
only for Unicode conjoining Hangul jamo characters.

general-category

numeric-values is present if the character is a number. It encloses the
numeric value as recorded in section 4.6. In addition, the attribute value
is the numeric value represented as a decimal number, without ',' to
separate the character groups. The attribute decimal can take the values
yes or no.

mirrored is present for those characters that have the mirrored property.

mathematical is present for characters that have the mathematical property.

                                                                           
  Open:                                                                    
                                                                           
                                       Look at the other properties on the 
                                       Unicode cdrom in proplist.txt.      
                                                                           
                                                                           
<!ELEMENT case (#PCDATA)>

<!ELEMENT combining-class (#PCDATA)>

<!ELEMENT directionality (#PCDATA)>

<!ELEMENT jamo-short-name (#PCDATA)>

<!ELEMENT general-category (#PCDATA)>

<!ELEMENT numeric-values (#PCDATA)>
<!ATTLIST numeric-values value CDATA #IMPLIED>
                         decimal CDATA #IMPLIED>

<!ELEMENT mirrored EMPTY>

<!ELEMENT mathematical EMPTY>

The informative-note element contains an informative note.

                                                                           
  Open:                                                                    
                                                                           
                                       What should be the DTD in there? A  
                                       fragment of docbook? The itsy bitsy 
                                       dtd?                                
                                                                           
                                                                           
<!ELEMENT informative-note ?>

Finally, these elements are assembled in a character element:

<!ELEMENT character ((unicode-name | name), unicode-1.0-name?,
          alternative-names?, code, char, cross-references?,
          compatibility-decomposition?, canonical-decomposition?,
          case?, combining-class?, directionality?, jamo-short-name?,
          general-category?, number-values?, mirrored?, mathematical?)>

Here are some examples:
<character>
  <name>LATIN CAPITAL LETTER A</name>
  <code>U+0041</code>
  <char>A</char>
  <direction>LR</direction>
</character>

<character>
  <name>COMBINING REVERSE SOLIDUS OVERLAY</name>
  <code>U+E000</code>
  <char>&#xe000;</char>
  <combining_class>1</combining_class>
</character>

<character>
  <name>DOLLAR SIGN</name>
  <alternate-names>
    <name>milreis</name>
    <name>escudo</name>
  </alternate-names>
  <code>U+0024</code>
  <char>$</char>
  <direction>LR</direction>
  <cross-references>
    <cross-ref><name>currency sign</name><code>0A4</code></cross-ref>
  </cross-references>
  <informative-note>Glyph may have one or two vertical bars. other
  currency symbol characters: 20A0 ā'  - 20AF ā'Æ</informative-note>
</character>

6. Collections


Collections are formed by grouping characters and by combining collections.
A collection is well-formed iff:

   No two characters have the same name, where the name of a character is
   defined as the value of the unicode-name element or the value of name
   element, whichever is present.

   No two characters have the same code.

An enumerated-collection is just a set of character elements.

<!ELEMENT enumerated-collection (character)*>

A ref-collection references a external collection (that is, external to the
resource in which this reference occurs). It must have a system identifier,
an URI, which may be used to retrieve the referenced collection. Relative
URIs are relative to the location of resource within which the
ref-collection occurs. In addition, there may be a public identifier. A
processor attempting to retrieve the referenced collection may use the
public identifier to try to generate an alternative URI. If the processor
is unable to do so, it must use the URI specified in the system identifier.

<!ELEMENT ref-collection EMPTY>
<!ATTLIST ref-collection
                  systemid CDATA #REQUIRED
                  publicid CDATA #IMPLIED>

A union-collection groups the characters of multiple collections. If the
set-wise union of those collections are not well-formed, characters of the
later collections are removed from the union.

<!ELEMENT union-collection (%collection;)*>

A subsetted-collection removes some the characters of a base collection.
The characters to remove are identified by their code value.

<!ELEMENT subsetted-collection (%collection;, code*)>

A remapped-collection reassigns new code points to the characters of a base
collection.

<!ELEMENT remapped-collection (%collection;, %map;)>

A simple-map just lists pairs of code points. Characters which are not
listed as the source of a pair are mapped to their original code point. No
two pairs should map from the same character. The map should not assign two
different characters to the same code point.

<!ELEMENT simple-map (replace)*>
<!ELEMENT replace EMPTY>
<!ATTLIST replace from CDATA #REQUIRED
                  to   CDATA #REQUIRED>

A shift-map adds an offset (positive or negative to each code point. By
construction it preserves well-formedness.

<!ELEMENT shift-map (#PCDATA)>  <!-- really, an integer-->

These are the only maps:

<!ENTITY % map "(simple-map|shift-map)">

And this complete the means of constructing collections:

<!ENTITY % collection "(enumerated-collection|union-collection
      ref-collection| subsetted-collection | remapped-collection)">

Here are some examples.
<collection>
  <union-collection>
    <ref-collection
      publicID="-//Unicode Consortium//CHC Unicode v3.0"
      systemID="ftp://ftp.unicode.org/data/chc/v3.0">

    <enumerated-collection>
      <character>
        <name>COMBINING REVERSE SOLIDUS OVERLAY</name>
        <code>U+E000</code>
        <char>&#xe000;</char>
        <combining_class>1</combining_class>
      </character>
    <enumerated-collection>
  </union-collection>
</collection>

Here is another collection that uses the same PUA code point, but defines
it differently:
<collection>
  <union-collection>
    <ref-collection
      publicID="-//Unicode Consortium//CHC Unicode v3.0"
      systemID="ftp://ftp.unicode.org/data/chc/v3.0"/>

    <enumerated-collection>
      <character>
        <name>Adobe Logo</name>
        <code>U+E000</code>
        <char>&#xe000;</char>
        <combining_class>1</combining_class>
      </character>
    </enumerated-collection>
  </union-collection>
</collection>

Let's assume that our first collection is accessible via the URI
http://atm.corp.adobe.com/chc/eric.chc and the second is accessible via the
URI http://oranda.corp.adobe.com/chc/adobecorp.chc. Just forming the union
of those collections will drop one of the two PUA characters (the one in
the collection mentionned second). The following collection can be built
for documents that need both PUA characters:
<collection>
  <union-collection>
    <remapped-collection>
      <ref-collection
        systemID="http://atm.corp.adobe.com/chc/eric.chc"/>
      <simple-map>
        <replace from="U+E000" to="U+E001"/>
      </simple-map>
    </remapped-collection>
    <ref-collection
      systemID="http://oranda.corp.adobe.com/chc/adobecorp.chc"/>
  </union-collection>
</collection>

In documents that use this collection, the code point U+E000 refers to the
Adobe Logo character, and the code point U+E001 refers to the COMBINING
REVERSE SOLIDUS OVERLAY characters.


7. Related work


The first source of inspiration is the XML world. In an XML document, the
element names that are used have no particular meaning by themselves, just
like the PUA code points have no meaning. But in the XML world, this is the
norm rather than the exception and mechanisms have been designed to cope
with that. In fact, these were a major source of inspiration: DTD and XML
schemas are similar character collections, namespaces correspond to the
collection bases, and the collection naming and referencing is based on DTD
naming and referencing.

The W3C NOTE A Notation for Character Collections for the WWW by Martin
Dürst is an XML DTD to describe sets of character code values. The main
objective is to be able to answer the question "Is this character code in
this collection?". Particular attention is paid to support efficient
implementation when the set descriptions are resources on the network.
While this is useful when the sets are made of standard characters, it's
really not enough to deal with private use characters, as it does not
attach a meaning to them.

The ConScript Unicode Registry by John Cowan and Michael Everson is a
registry of Private Use Area uses. The goal of this effort is really to
have a centralized allocation of the private use area. It does not attempt
to record semantics of the characters.

Copyright (c) 2001 Adobe Systems Inc.