L2/09-346 Title: A Framework for Unicode Metaproperty Maintenance Author: Ken Whistler Date: October 16, 2009 Action: For consideration by UTC Background The UTC has been developing and extending character properties for the Unicode Standard for nearly two decades now. The first machine-readable form of these was UnicodeData-1.1.5.txt, which became available in 1995 in support of implementations of Unicode 1.1, but the development of properties goes back even further to tables that were printed as part of Unicode 1.0. Over the years, the addition of properties has been a rather ad hoc process, driven more by year-by-year assessment of new implementation needs than by any overall framework for how to proceed in this area. In this document I want to make the case for stepping up to a more rigorous framework for documenting and maintaining metaproperties for Unicode character properties. I contend that this will help guide the continuing documentation of those properties that we already publish and maintain, but also will help guide the UTC to better decisions when introducing new character properties or when changing the status of any existing properties in the UCD. History As the number of properties that the UTC maintains has grown, and as the importance of the meaning and stability of those properties for Unicode implementations has also grown, more and more of an explanatory and documentary framework has also developed around the properties. Some of the historical milestones in this growing framework include: July 1996: Publication of ReadMe-2.0.14.txt This was the first attempt at systematic explanation of the content and structure of UnicodeData.txt, and is the founding character properties document that has continuously evolved now for over 13 years. Sept 1999: Publication of UnicodeCharacterDatabase-3.0.0.html, UnicodeData-3.0.0.html, and NamesList-1.html These were the first HTML versions of the documentation files, released for Unicode 3.0. They expanded the scope of documentation to attempt to explain the growing number of data files in the Unicode Character Database. March 2001: Publication of DerivedProperties-3.1.0.html and PropList-3.1.0.html Unicode 3.1 was the first version that included derived extracted character property information in separate data files. The DerivedProperties-3.1.0.html file explained those data files and the way they presented character property information. PropList-3.1.0.html extended the information about character properties to try to establish status for the additional character properties maintained in PropList.txt. March 2002: Publication of PropertyAliases-3.2.0.txt and PropertyValueAliases-3.2.0.txt These data files, while ostensibly aimed at the problem of providing aliases for character properties for use in regex, also surfaced serious questions about the status and structure of all the existing character properties. These were the first machine-readable lists that attempted to explicitly list all of the properties and which also attempted to explicitly list all of the valid values for each property. In essence, these data files represented the first rigorous attempt at a machine-readable representation of some of the metaproperties of the Unicode character properties. April 2003: Publication of UCD-4.0.0.html For Unicode 4.0, the information about character properties separately available in UnicodeCharacterDatabase.html, UnicodeData.html, DerivedProperties.html, and PropList.html was consolidated into a single large documentation file, UCD.html. This consolidation brought together a lot of separate status information about properties, and required some effort at making that information more systematic and consistent for presentation. July 2004: Publication of UTR #23, "The Unicode Character Property Model" This UTR by Asmus Freytag attempted to put Unicode character properties in a systematic framework. It discussed the status and types of properties. As such, it was the first attempt at an explicit model that would include metaproperties for Unicode character properties. Much of the content of this UTR was later incorporated in updates for Section 3.5 Properties and Chapter 4, Character Properties, in Version 5.0 of the Unicode Standard itself. March 2005: Publication of Unihan.html for Unicode 4.1 This documentation file was the first step in extracting systematic documentation about the Unihan Database from the comments that were formerly maintained at the top of Unihan.txt. April 2008: Publication of UAX #38, UAX #42, and UAX #44 for Unicode 5.1 UAX #42, "Unicode Character Database in XML", forced a rigorous analysis of Unicode character properties, so that they could be completely expressed in XML. The publication of UAX #42 and the commitment to release each version of the UCD in a parallel XML version also introduced a series of process steps that now have to be accounted for whenever new character properties are added. UAX #38, "Unicode Han Database (Unihan)", superseded Unihan.html and updated the documentation for properties maintained in the Unihan Database. UAX #44 started the process of upgrading the status of the rest of the UCD documentation to become a formal, normative annex of the standard. Oct 2009: Publication of UAX #44 for Unicode 5.2 UAX #44, "Unicode Character Database", for Version 5.2 completed the superseding of the older documentation file, UCD.html. In the process of updating UAX #44, the information about character properties in the annex was systematically filled out with information that had been scattered about or only implied elsewhere. The annex now comes much closer to being complete documentation of Unicode character properties and their status. ====================================================================== Framework for Metaproperties What I am proposing is that the UTC decide on a framework for maintaining metaproperties for all Unicode character properties. Getting there would consist essentially of 3 steps. First, we need to nail down the exact list of metaproperties that the UTC considers important for proper maintenance of character properties. This is partly a matter of simply collecting together metaproperties that we already define, document, and more or less track explicitly -- although we don't actually call them all "metaproperties". And partly it is a matter of analyzing the way we treat character properties in more detail and then abstracting out metaproperties that reflect that treatment, so that they can be explicitly recognized and maintained, instead of just being seat-of-the-pants judgements we make -- sometimes after the fact -- when deciding to approve new properties or to modify existing properties. Second, for all of the character properties that the UTC currently recognizes and maintains, we need to specify all of the metaproperty values for each in detail. This is a matter of simply carrying through on the somewhat inchoate and half-completed project to "document" Unicode properties: PropertyAliases.txt now spells out exact long and short names for each property, for example, and UAX #44 explicitly lists property status (normative/informative/contributory), gives default values for properties, and so on. Third, once the list of metaproperties and a full set of their values are in place, we need to decide exactly how to publish those values in machine-readable form, and what process to put in place to ensure that all decisions to approve new properties or to modify existing properties get properly reflected into the listing of metaproperty information. This document is my attempt to start the first two steps. The listing below is my candidate list of metaproperties. I divide it into 4 major parts: Names for properties, Type attributes of properties, Status attributes of properties, and Documentation information about properties. Each *'d item in the list is a candidate metaproperty. In cases where we already have a related formal definition in Chapter 3.5, "Properties" in the standard, I have called out the relevant definition number for reference. Accompanying this document is a spreadsheet that contains my tentative assignment of metaproperty values for all Unicode character properties (except for Provisional Unihan properties). I've also tossed in a few properties that are not formally recognized in PropertyAliases.txt, but which are treated either in the UCD or in the Unicode Standard *as if* they were properties. What we should do with such additional properties is open to interpretation, of course, and is a matter for further discussion. I would like the UTC to take further steps towards formally recognizing these metaproperties and exploring various alternatives to include this framework (and updated exact values) as a part of Unicode 6.0. ------------------------------------------------------------------------ 1. Names * Name (= Long Name) * Abbreviated Name (= Short Name) * Aliases (= Other Aliases) This information is all tracked in the PropertyAliases.txt file. Property alias is already defined in D47 in Section 3.5. Currently Property alias is defined indifferently as just any of the labels in PropertyAliases.txt, but it is clear that the Long Name and the Short Name have special status and need to be separately identified and tracked in some way. ------------------------------------------------------------------------ 2. Type Information * Type This is the major structural category for the property. Values for type include: Numeric, String, Miscellaneous, Catalog, Enumerated, Binary. Enumerated is already defined in D27 in Section 3.5. Boolean is already defined in D29 in Section 3.5 (but is inconsistent with the usage of "Binary" in the categorization of properties in UAX #44). Numeric is already defined in D30 in Section 3.5. String-valued is already defined in D31 in Section 3.5 (but IMO requires updating, because it is mixed up with the data type). Catalog is already defined as D32 in Section 3.5. * Class This is the semantic class of the property. I.e. what kind of content does it contain, rather than what is its structural type. Values for class include: Attribute, Mapping, Age, Radical/Stroke, Annotation, Name, Order, Block * Scope of Use This refers to the general area of the standard the property is relevant to, including major applicability in algorithms. Examples of scope of use include: Numbers, Bidi, Casing, Normalization, CJK Ordering, Script & Regex, Display, Segmentation, Shaping, etc. * Data Type This refers to the data type that would be returned by an API implementing the property. Values for data type include: Numeric Literal, Code Point, Code Point Sequence, Version String Literal, Numeric Tuple, String Literal, Enumerated String Literal, Name String Literal Enumerated Symbol, Boolean, Ternary For many data types, particularly the enumerated types, the exact values are given aliases in PropertyValueAliases.txt. * Data Value Range This further characterizes the acceptable values for the data type, and would provide information as input to regex validation strings for property values. Examples for data value ranges would include: ASCII UDH + SPACE, Unicode code space, Positive integers, etc., etc. * Default Value A defined default value specified for most properties, and required for any enumerated property. Default property value is already defined in D25 in Section 3.5. * Number of Data Values This is relevant to properties with enumerated data types, including Boolean and Ternary properties. * Closed For properties with enumerated data types, the number of possible values can either be closed or not. All Boolean properties are, by definition, closed. Closed is already defined in D28 in Section 3.5. * Code Point A binary attribute of a property: True for code point properties, False for other properties. Code point property is already defined in D20 in Section 3.5. ------------------------------------------------------------------------ 3. Status Information * Status This is the main status classification for a property. Values for status are: Normative, Informative, Contributory, Provisional Normative is already defined as D33 in Section 3.5. Informative is already defined as D35 in Section 3.5. Contributory is already defined as D35a in Section 3.5. Provisional is already defined as D36 in Section 3.5. * Contributory To For Contributory properties, this defines what other property they contribute to the definition of. * Derivation Status Values for derivation status are: Simple, Derived, Mixed. For Derived and Mixed, the actual derivations are typically listed in the relevant data file as a comment for the property. Simple is already defined by D45 in Section 3.5. Derived is already defined by D46 in Section 3.5. The term "Mixed" refers to a minority of properties that have part of their values defined by explicit lists and part by algorithmic derivation. "Name" is such a property. * Overridable (= Tailorable) A binary attribute of a property. Overridable is already defined as D34 in Section 3.5. * Stability Values for stability are: Immutable, Fixed, N/A. Fixed is already defined by D41 in Section 3.5. Immutable is already defined by D42 in Section 3.5. * Obsolete A binary attribute of a property. * Deprecated A binary attribute of a property. Deprecated is already defined by D44 in Section 3.5. * Stabilized A binary attribute of a property. Stabilized is already defined by D43 in Section 3.5. For the Obsolete, Deprecated, and Stabilized attributes, an Age value is also required to be tracked for each property when it acquires that attribute. ------------------------------------------------------------------------ 4. Documentation Information * XML Schema Section number of attribute documentation in UAX #42. * Data File The data file in the UCD which defines the property (including the field number, if pertinent). * Documentation UAX, book, or other locations for primary documentation of the property. * Age Version number of the standard when the property was first defined.