|Re:||TR #23 Comments, and Normative Properties|
I had a number of comments on TR #23 and related issues. Most of these came up during the editorial meeting, but require thought by the UTC. It also relates to conformance issues that I know that Cathy/Michael/Sandra are eager to start working on.
I am slightly troubled by some aspects of the interaction of algorithms and properties in this document.
PD10. Normative Property
A Unicode character property whose values are required for conformance to the
I realize we say this in the book, but this is way off the mark. The values are not required for conformance. I don't have to implement all such values to be conformant.
Note: A normative process that depends in a normative and testable way on a property, causes the property to be normative. For example, the interpretation of the bidirectional class is precisely defined in [Bidi].
This is also too strong. We may have tests that use informative properties;
that doesn't automagically make them normative.
If an algorithm is normative, it doesn't matter what internal properties are used, as long as it produces the right results. Somehow we have to reflect that. What we could say is that if
THEN the application is being bad (well for lack of a better term and since some people don't want to say 'non-conformant').
We also do not make sufficiently clear what the difference is between an
overridable normative property, and an informative property, in an algorithm.
One stab at the difference is that an program must document that it is
overriding a normative property, whereas it doesn't need to document that it is overriding an informative property. But we need more discussion.
The other is that anyone can call a character property of his/er own, White_Space. As long as s/he doesn't purport that this represents the Unicode normative property, it should have no conformance implications. We can't trademark all those words!
The draft TR has the following:
- PD14. Stable Property
The statement of this does not capture the intent. According to the strict statement of the definition, if canonical combining classes are stable, then "changes in the assignment of property values produce no changes". That is patently untrue for canonical combining classes. Their relative order must be maintained, which is not captured in the definition.
The TR has:
4.1 Conformance Requirements
In Chapter 3, Conformance, The Unicode Standard [Unicode] states that "A process shall interpret a coded character representation according to the character semantics established by this standard, if that process does interpret that coded character representation." The semantics of a character are established by taking its coded representation, character name and representative glyph in context and are further defined by its normative properties and behavior. Neither character name nor representative glyphs can be relied upon absolutely; a character may have a broader range of use than the most literal interpretation of its character name, and the representative glyph is only indicative of one of a range of typical glyphs representing the same character.
Interestingly, this paragraph (and the book) logically implies that someone can't override normative properties, since they are part of semantics, and TUS requires the interpretation according to the semantics. That is, of course, not what we intend.
The text has:
6.3 Undetermined Property Values
For many archaic scripts (as well as for not yet fully implemented modern ones) essential characteristics of many characters may not be knowable at the time of their publication. In these cases the proper assignments of property values for newly encoded characters cannot be reliably determined at the time the characters are first added to the Unicode Standard, or for a new property, when the property is first added to the Unicode Character Database. In these cases, and where the property is a required property, it might be given a value of 'undetermined', or 'unknown at time of publication'.
Currently no property has been given such values and the conditions under which they would be applied, or in which form, have not yet been defined.
The UTC has made no such action, so I felt it very misleading to put this in. Moreover, I think the whole notion of "undetermined values" is not, well, well-determined. They would have to be worked into any algorithms that use them very carefully, so that they do not produce bad results. And it is unclear that they would be of benefit. For example, suppose that we had them in BIDI; what should they be treated like? L? AL?, R?, N? They have to have some meaning to be something, and it is unclear that any particular value suffices.
It is almost as if it is an orthogonal attribute: that this BIDI value (whatever it is!!) on this character is provisional.
I'll repeat a comment by Ken:
Lack of clarity about the property assignments is, in my mind, a big red warning flag that there are holes or misconceptions in the encoding model for the script. And that usually means before we go ahead and approve the BIZORTIUM GLYPH FLIPPER we had best cool off and examine the whole script yet again.
Here are some thoughts about more precise definitions for some of the properties, especially leading up to foldings, which are an important case. Note that this deals with strings both as a series of code points, and as a series of code units, which reflects practice. These are not fully formed yet, but should be useful for discussion.
An offset into a Unicode string is a number from 0 to n, where n is the length of the string, and indicates a position that is logically between Unicode code units (or at the very front or end in the case of 0 or n respectively).
X[a, b] is the substring of X that includes all code units after offset a and before offset b. For example, if X is "abc", then X[1,2] is "b".
A string function is a function whose input is a sequence of code units within a substring, e.g. value = f(string, start_offset, end_offset). If the start and end offsets are 0 and length respectively, we write value = f(string).
A text boundary function is a string function whose output is a boolean, and that is only defined where start_offset equals end_offset. The default line, word, grapheme cluster, and sentence boundaries are text boundary functions.
A context-independent string function is a string function that doesn't depend on anything before the start_offset or after the end_offset. In other words, for all strings x, y, z, f(y,0,len(y)) = f(x+y+z, len(x), len(x+y)).
A context-dependent string function is a string function that is not context-independent. In other words, there are some x, y, z such that f(y,0,len(y)) != f(x+y+z, len(x), len(x+y)). Thus the text boundary functions cited above are context dependent, as are bidi reordering, shaping, etc.
A code point function is a string function that is defined only where the start_offset and end_offset mark the start and end (respectively) of the same code point. The General Category is a context-independent code point property.
A character function is a code point function that is defined only on assigned characters. That is, applying the function to a code point whose General Category is Cn will produce an error.
A folding function is idempotent context-independent string function. Idempotent means that the output of the function is a string, and repeated applications of the same function produce the same output: f(f(x)) = f(x), for all x.
Every folding establishes a set of equivalence classes that partitions all strings, where x ≡ y if and only if f(x) = f(y). Normalization is an example of a folding.
A count-preserving string function is is a string function whose result is a string, where the length of the input in code points is identical to the length of the output in code points. For example, the simple case mappings are count-preserving; the full case mappings are not.
A length-preserving string function for given code unit is a string function whose result is a string, and len(value) = end_offset - start_offset. Thus such a function neither grows nor shrinks a string. Note that the simple case mappings are not necessarily length-preserving for UTF-8 or UTF-16: we do not guarantee that the result is always of the same length in code units; a 2-byte UTF-8 character might have a 3-byte UTF-8 case variant.