L2/03-134

Re: TR #23 Comments, and Normative Properties
From: Mark Davis
Date: 2003-04-29

I had a number of comments on TR #23 and related issues. Most of these came up during the editorial meeting, but require thought by the UTC. It also relates to conformance issues that I know that Cathy/Michael/Sandra are eager to start working on.

Interaction of Properties and Algorithms

I am slightly troubled by some aspects of the interaction of algorithms and properties in this document.

PD10. Normative Property
A Unicode character property whose values are required for conformance to the
standard.

I realize we say this in the book, but this is way off the mark. The values are not required for conformance. I don't have to implement all such values to be conformant.

Note: A normative process that depends in a normative and testable way on a property, causes the property to be normative. For example, the interpretation of the bidirectional class is precisely defined in [Bidi].

This is also too strong. We may have tests that use informative properties; that doesn't automagically make them normative.

If an algorithm is normative, it doesn't matter what internal properties are used, as long as it produces the right results. Somehow we have to reflect that. What we could say is that if

  1. those properties are externalized AND
  2. they purport to match Unicode normative properties AND
  3. either the properties are not overridable or the application does not document that it overrides them

THEN the application is being bad (well for lack of a better term and since some people don't want to say 'non-conformant').

We also do not make sufficiently clear what the difference is between an overridable normative property, and an informative property, in an algorithm. One stab at the difference is that an program must document that it is
overriding a normative property, whereas it doesn't need to document that it is overriding an informative property. But we need more discussion.

The other is that anyone can call a character property of his/er own, White_Space. As long as s/he doesn't purport that this represents the Unicode normative property, it should have no conformance implications. We can't trademark all those words!

Stability

The draft TR has the following:

PD14. Stable Property
A property is stable with respect to a particular algorithm or process, if changes in the assignment of property values produce no changes in the outcome of the process or algorithm.
 
For example, while the absolute values of the canonical combining classes are not guaranteed to be the same between versions of the Unicode Standard, their relative values will be maintained. As a result, they are stable with respect to the Normalization Forms as defined in [Normal].

The statement of this does not capture the intent. According to the strict statement of the definition, if canonical combining classes are stable, then "changes in the assignment of property values produce no changes". That is patently untrue for canonical combining classes. Their relative order must be maintained, which is not captured in the definition.

Conformance Requirements

The TR has:

4.1 Conformance Requirements

In Chapter 3, Conformance, The Unicode Standard [Unicode] states that "A process shall interpret a coded character representation according to the character semantics established by this standard, if that process does interpret that coded character representation."  The semantics of a character are established by taking its coded representation, character name and representative glyph in context and are further defined by its normative properties and behavior. Neither character name nor representative glyphs can be relied upon absolutely; a character may have a broader range of use than the most literal interpretation of its character name, and the representative glyph is only indicative of one of a range of typical glyphs representing the same character.

Interestingly, this paragraph (and the book) logically implies that someone can't override normative properties, since they are part of semantics, and TUS requires the interpretation according to the semantics. That is, of course, not what we intend.

Undetermined Property Values

The text has:

6.3 Undetermined Property Values

For many archaic scripts (as well as for not yet fully implemented modern ones) essential characteristics of many characters may not be knowable at the time of their publication. In these cases the proper assignments of property values for newly encoded characters cannot be reliably determined at the time the characters are first added to the Unicode Standard, or for a new property, when the property is first added to the Unicode Character Database. In these cases, and where the property is a required property, it might be given a value of 'undetermined', or 'unknown at time of publication'.

Currently no property has been given such values and the conditions under which they would be applied, or in which form, have not yet been defined.

The UTC has made no such action, so I felt it very misleading to put this in. Moreover, I think the whole notion of "undetermined values" is not, well, well-determined. They would have to be worked into any algorithms that use them very carefully, so that they do not produce bad results. And it is unclear that they would be of benefit. For example, suppose that we had them in BIDI; what should they be treated like? L? AL?, R?, N? They have to have some meaning to be something, and it is unclear that any particular value suffices.

It is almost as if it is an orthogonal attribute: that this BIDI value (whatever it is!!) on this character is provisional.

I'll repeat a comment by Ken:

Lack of clarity about the property assignments is, in my mind, a big red warning flag that there are holes or misconceptions in the encoding model for the script. And that usually means before we go ahead and approve the BIZORTIUM GLYPH FLIPPER we had best cool off and examine the whole script yet again.

Mathy Stuff

Here are some thoughts about more precise definitions for some of the properties, especially leading up to foldings, which are an important case. Note that this deals with strings both as a series of code points, and as a series of code units, which reflects practice. These are not fully formed yet, but should be useful for discussion.