L2/10-052 Title: Towards Unicode Metaproperty Definition Author: Ken Whistler Date: January 28, 2010 References: L2/09-346, L2/09-347 Action: For consideration by UTC Background For the background and history of this topic, please refer to L2/09-346. For the related spreadsheet of suggested metaproperty values, refer to L2/09-347. ====================================================================== Framework for Metaproperties This document updates somewhat the details of the framework for metaproperties first developed in L2/09-346. I have tried to address some of the concerns raised during the discussion at the November, 2009 UTC meeting. The description of each of the proposed metaproperties has been extended, with more examples given, and each metaproperty is labelled now with an alphanumeric catalog number, to simplify reference. I would like the UTC to take further steps towards formally recognizing these metaproperties and exploring various alternatives to include this framework (and updated exact values) as a part of Unicode 6.0. ------------------------------------------------------------------------ Section 1. Names of Properties These are formal, normative labels used to refer to character properties. * Metaproperty N1: Name (= Long Name) * Metaproperty N2: Abbreviated Name (= Short Name) * Metaproperty N3: Aliases (= Other Aliases) This information is all tracked in the PropertyAliases.txt file. Property alias is already defined in D47 in Section 3.5. Currently Property alias is defined indifferently as just any of the labels in PropertyAliases.txt, but it is clear that the Long Name and the Short Name have special status and need to be separately identified and tracked in some way. Each property in the UCD now has a single long name and a single short name. In a few cases they have the same values. ------------------------------------------------------------------------ Section 2. Types of Properties These metaproperties refer to how we classify the character properties. * Metaproperty T1: Type This is the major structural category for the property. Values for type include: Numeric, String, Miscellaneous, Catalog, Enumerated, Binary. We already use this classification extensively in the documentation of character properties, so I don't see any reason not to recognize this as a metaproperty. Enumerated is already defined in D27 in Section 3.5. Boolean is already defined in D29 in Section 3.5 (but is inconsistent with the usage of "Binary" in the categorization of properties in UAX #44). Numeric is already defined in D30 in Section 3.5. String-valued is already defined in D31 in Section 3.5 (but IMO requires updating, because it is mixed up with the data type). Catalog is already defined as D32 in Section 3.5. * Metaproperty T2: Class This is the semantic class of the property. I.e. what kind of content does it contain, rather than what is its structural type. Values for class include: Attribute, Mapping, Age, Radical/Stroke, Annotation, Name, Order, Block The class of properties is not clearly documented anywhere, but is implicitly recognized, because properties have to be documented and enumerated differently, depending on which of these classes they belong to. * Metaproperty T3: Scope of Use This refers to the general area of the standard the property is relevant to, including major applicability in algorithms. Examples of scope of use include: Numbers, Bidi, Casing, Normalization, CJK Ordering, Script & Regex, Display, Segmentation, Shaping, etc. This metaproperty is not officially recognized, but again is implicitly used already in grouping properties for documentation. It also is relevant to specifying which particular properties are needed for support of one or another of the major Unicode algorithms. * Metaproperty T4: Data Type This refers to the formal data type that would be returned by an API implementing the property. Values for data type include: Numeric Literal, Code Point, Code Point Sequence, Version String Literal, Numeric Tuple, String Literal, Enumerated String Literal, Name String Literal Enumerated Symbol, Boolean, Ternary For many data types, particularly the enumerated types, the exact values are given aliases in PropertyValueAliases.txt. For many of the particular values of Data Type, there are other features of the property that are interesting to list explicitly. These are not all entirely independent of Data Type -- many can be deduced directly from the Data Type of a property. But for tracking properties it makes sense to provide explicit values where feasible. * Metaproperty T4.1: Data Value Range For some Data Types, explicit valid ranges for data values can be specified. Others, such as properties with a String Literal data type, may not have ranges, but may be limited to use of certain characters or following certain patterns. Specification of these ranges provides information that can be used as input to regex validation strings for property values. This is already done, in part, in Section 5.9, Validation, of UAX #44. Examples for data value ranges would include: ASCII UDH + SPACE, Unicode code space, Positive integers, etc., etc. * Metaproperty T4.2: Number of Data Values For enumerated data types, for any given version of the standard, the exact number of defined data values can be specified. Of course, for Boolean and Ternary properties, the exact number of defined data values are fixed by the data type itself. * Metaproperty T4.3: Closed For properties with enumerated data types, the number of possible values can either be closed or not. All Boolean properties are, by definition, closed. Closed is already defined in D28 in Section 3.5. Closed is itself a binary attribute of properties. In a way, Closed verges on being a Status metaproperty, rather than a Type metaproperty, because it is essentially a guarantee that no new data values will be added to a property in the future. However, since it is closely related to the number of data values, it probably makes most sense to document the two together. * Metaproperty T4.4: Default Value A defined default value specified for most properties, and required for any enumerated property. Default property value is already defined in D25 in Section 3.5. * Metaproperty T4.5: Maximum Value For a few numeric character properties, it is possible to specify a maximum value, which can be helpful for optimization of implementations. The obvious example is Canonical_Combining_Class. * Metaproperty T5: Code Point A binary attribute of a property: True for code point properties, False for other properties. Code point property is already defined in D20 in Section 3.5. ------------------------------------------------------------------------ Section 3. Status of Properties * Metaproperty S1: Status This is the main status classification for a property. It has to do with how the UTC thinks the property in question articulates with the definition of the standard, its associated algorithms, and other properties, and what kind of commitment the UTC has made to ongoing maintenance of the property. Current values for status are: Normative, Informative, Contributory, Provisional Normative is already defined as D33 in Section 3.5. Informative is already defined as D35 in Section 3.5. Contributory is already defined as D35a in Section 3.5. Provisional is already defined as D36 in Section 3.5. * Metaproperty S2: Derivational Status Values for derivation status are: Simple, Derived, Mixed. For Derived and Mixed, the actual derivations are typically listed in the relevant data file as a comment for the property. Simple is already defined by D45 in Section 3.5. Derived is already defined by D46 in Section 3.5. The term "Mixed" refers to a minority of properties that have part of their values defined by explicit lists and part by algorithmic derivation. "Name" is such a property. * Metaproperty S2.1: Derivation For Derived and Mixed properties, the derivation is the actual statement of the details of the derivation. For the most part this is now explicitly spelled out in a comment field in the relevant files documenting derived properties, most notably, DerivedCoreProperties.txt and DerivedNormalizationProps.txt. * Metaproperty S2.2: Contributory To For Contributory properties, this simply defines what other property this particular property contributes to the definition of. For example, Other_Alphabetic is a Contributory property, and it is Contributory To the definition of the Derived property, Alphabetic. * Metaproperty S2.3: Primary Status Some Derived properties are considered Primary properties, definitive of some basic character property. The derivation is provided in the UCD mainly as an aid to definition of the property and for stability in maintenance. An example is Lowercase, which is primarily what is being defined for the use in question. Other_Lowercase, the Contributory property used to define Lowercase, is merely a convenient means to itemize exceptions. On the other hand, some Derived properties are merely convenience listing of complicated derivations for specialized purposes. Examples of such Derived properties include Changes_When_Casefolded. * Metaproperty S3: Overridable (= Tailorable) This is a binary attribute of a property. It augments the regular status of a property with the information that the intent is that the property values may be freely overridden by users to produce certain effects in implementation. Overridable is already defined as D34 in Section 3.5. * Metaproperty S4: Stability Values for stability are: Immutable, Fixed, N/A. Stability establishes certain rules for what the UTC will not change about character properties. Fixed is already defined by D41 in Section 3.5. Immutable is already defined by D42 in Section 3.5. * Metaproperty S5: Obsolete This is a binary attribute of a property. Obsolete status simply means that for some reason a property still in the UCD has come to be viewed by the UTC as obsolete for whatever it was originally defined for. Perhaps it has been replaced by a different property, or is simply no longer considered important. Obsolete could in principle be considered another major status value, but it is cleaner to treat it as a separate binary attribute. It is possible to have obsolete informative properties, like Hyphen, or obsolete contributory properties, like Grapheme_Link, or obsolete provisional properties, like kMandarin, for example. * Metaproperty S6: Deprecated This is a binary attribute of a property. This represents a formal discouragement from use for a property by the UTC. Deprecated is already defined by D44 in Section 3.5. * Metaproperty S7: Stabilized This is a binary attribute of a property. This indicates that a property in the UCD will no longer be applied to newly introduced characters, or otherwise be maintained. Stabilized is already defined by D43 in Section 3.5. For the Obsolete, Deprecated, and Stabilized attributes, an Age value is also required to be tracked for each property when it acquires that attribute. ------------------------------------------------------------------------ Section 4. Documentation Information These are documentation-related attributes for UCD properties. They essentially represent the bookkeeping attributes for tracking information about each property. * Metaproperty D1: XML Schema Id Section number for the formal attribute documentation in UAX #42. * Metaproperty D2: Data File The data file in the UCD which defines the property (including the field number, if pertinent). This information is already provided, for most character properties, in either UAX #44 or UAX #38. * Metaproperty D3: Documentation Location UAX, book, or other locations for primary documentation of the property. This information is also often mentioned in UAX #44, or in Chapter 4 of the core specification. * Metaproperty D4: Age The version number of the standard when the property was first defined. This is the property analog of the *character* property Age. Just as each character gets an Age value according to when it was first assigned in the standard, so it is possible to also define an Age for each property when it was first defined for the standard.