L2/12-075R

Subject: Overridable Properties
From: Mark Davis
Date: 2012-02-07

We have the following text in the standard:

    "Thus the decomposition of Unicode characters is both normative and not overridable; no higher-level protocol may override these values, because to do so would result in non-interoperable results for the normalization of Unicode text. Other normative properties, such as case mapping, are overridable by higher-level protocols, because their intent is to provide a common basis for behavior. Nevertheless, they may require tailoring for particular local cultural conventions or particular implementations.

    D34 Overridable property: A normative property whose values may be overridden by conformant higher-level protocols.

    • For example, the Canonical_Decomposition property is not overridable. The Uppercase property can be overridden."


However, this is broken. Without the following changes, defining "D34" is pointless, and we'd be better off without it.

    We must make it clear which precisely which properties are overridable and which are not (except for that one example)! If this is to be useful, it must also be in a machine-readable file.
        Note that we may need to have more than a binary status {Yes, No, Partial}; some properties like General Category may allow some values to be overridden and some not.
    We must it make clear what the implications of being Overridable are. A reader of the standard must be able to clearly understand, for example, which of the following are conformant and which are not.


Examples

A1. Override getter

// overrides Unicode's (arguably incorrect) GC value for {# & @ % ‰ ‱ * † ‡ ※}
SymbolOther == getGeneralCategory('@')

A2.  Override getter for private use.

// Overrides the GC for the Apple Logo (private use)
SymbolOther == getGeneralCategory('\uF8FF') // overrides Unicode's (bad IMO) definition

A3. Override getter, with alternative "unmodified function"

getUnicodeGeneralCategory(x) // returns the unmodified Unicode values
getGeneralCategory(x) // returns the modified Unicode values

A4. Override getter, with alternative "modified function"

getGeneralCategory(x) // returns the unmodified Unicode values
getXGeneralCategory(x) // returns the modified Unicode values


B1. Don't override property, but override dependent algorithm results

y = getPropertyX(x) // returns UCD property value for X(x)
doUnicodeAlgorithm(string_with_x) // the algorithm uses UCD property X, as if X(x) == y (UCD value)

B2. Override property, but not dependent algorithm results

z = getPropertyX(x) // doesn't return UCD property value for X(x), which would be y.
doUnicodeAlgorithm(string_with_x) // the algorithm uses UCD property X, as if X(x) == z, *not* y

E. Override property, and dependent algorithm results

z = getPropertyX(x) // doesn't return UCD property value for X(x), which would be y.
doUnicodeAlgorithm(string_with_x) // the algorithm uses UCD property X, as if X(x) == y (UCD value)

F. Override property, but not derived property

LetterOther == getGeneralCategory('?')
false == isIdentifierStart('?').\

etc.


We need to look at these with a variety of particular properties and algorithms in mind. We do have a list of properties in

http://unicode.org/reports/tr44/#Property_Index

We don't really have a comprehensive list of algorithms and their conformance clauses, but we should look at

BIDI
isIdentifier
LineBreak
WordBreak
Regex
...
http://unicode.org/faq/specifications.html

And conformance clauses in the core spec, + UAXes, + UTSes.