L2/12-075R Subject: Overridable Properties From: Mark Davis Date: 2012-02-07 We have the following text in the standard: "Thus the decomposition of Unicode characters is both normative and not overridable; no higher-level protocol may override these values, because to do so would result in non-interoperable results for the normalization of Unicode text. Other normative properties, such as case mapping, are overridable by higher-level protocols, because their intent is to provide a common basis for behavior. Nevertheless, they may require tailoring for particular local cultural conventions or particular implementations. D34 Overridable property: A normative property whose values may be overridden by conformant higher-level protocols. • For example, the Canonical_Decomposition property is not overridable. The Uppercase property can be overridden." However, this is broken. Without the following changes, defining "D34" is pointless, and we'd be better off without it. We must make it clear which precisely which properties are overridable and which are not (except for that one example)! If this is to be useful, it must also be in a machine-readable file. Note that we may need to have more than a binary status {Yes, No, Partial}; some properties like General Category may allow some values to be overridden and some not. We must it make clear what the implications of being Overridable are. A reader of the standard must be able to clearly understand, for example, which of the following are conformant and which are not. Examples A1. Override getter // overrides Unicode's (arguably incorrect) GC value for {# & @ % ‰ ‱ * † ‡ ※} SymbolOther == getGeneralCategory('@') A2. Override getter for private use. // Overrides the GC for the Apple Logo (private use) SymbolOther == getGeneralCategory('\uF8FF') // overrides Unicode's (bad IMO) definition A3. Override getter, with alternative "unmodified function" getUnicodeGeneralCategory(x) // returns the unmodified Unicode values getGeneralCategory(x) // returns the modified Unicode values A4. Override getter, with alternative "modified function" getGeneralCategory(x) // returns the unmodified Unicode values getXGeneralCategory(x) // returns the modified Unicode values B1. Don't override property, but override dependent algorithm results y = getPropertyX(x) // returns UCD property value for X(x) doUnicodeAlgorithm(string_with_x) // the algorithm uses UCD property X, as if X(x) == y (UCD value) B2. Override property, but not dependent algorithm results z = getPropertyX(x) // doesn't return UCD property value for X(x), which would be y. doUnicodeAlgorithm(string_with_x) // the algorithm uses UCD property X, as if X(x) == z, *not* y E. Override property, and dependent algorithm results z = getPropertyX(x) // doesn't return UCD property value for X(x), which would be y. doUnicodeAlgorithm(string_with_x) // the algorithm uses UCD property X, as if X(x) == y (UCD value) F. Override property, but not derived property LetterOther == getGeneralCategory('?') false == isIdentifierStart('?').\ etc. We need to look at these with a variety of particular properties and algorithms in mind. We do have a list of properties in http://unicode.org/reports/tr44/#Property_Index We don't really have a comprehensive list of algorithms and their conformance clauses, but we should look at BIDI isIdentifier LineBreak WordBreak Regex ... http://unicode.org/faq/specifications.html And conformance clauses in the core spec, + UAXes, + UTSes.