L2/12-041R

Title: Overriding Default Properties for PUA -- Core Specification Text Issues

Source: Ken Whistler and Editorial Committee

Date: February 6, 2012

Action: For consideration by the UTC


During the process of text editing and cleanup of the Core Specification for Unicode 6.1,
the editorial committee ran across a potential inconsistency regarding the details
of what the standard claims about overriding default properties for PUA characters.
I am bringing the details of the text issues to the attention of the UTC, for discussion
and decision.

Current Text for Unicode 6.1

In the latest current draft of the Core Specification for Unicode 6.1, the relevant
text in Chapter 3, Section 3.5 states (on p. 72):

Default property values are also provided for private-use characters. Because the
interpretation of private-use characters is subject to private agreement between the
parties which exchange them, the default property values for those characters
are overridable by higher-level protocols, to match the agreed-upon semantics
for the characters. See Section 16.5, Private-Use Characters.

[Note that this is new text added to the draft of Unicode 6.1, in an attempt to
clarify the issue of default property values and PUA. So this isn't already
published text from Unicode 6.0.]

The relevant text in the latest current draft of the Core Specification for Unicode 6.1,
Section 16.5 states (on p. 555):

The General_Category value of private-use characters in the Unicode Standard
is Private_Use (gc=Co). This value is normatively defined and cannot be changed
by private agreement. This means that no private agreement can change which
character codes are reserved for private use. However, many Unicode algorithms
use character properties which are derived by reference to the General_Category
property. Private agreements may override such derivations for private-use
characters, except where overriding is expressly disallowed in the conformance
statement for a specific algorithm. In other words, private agreements may define
which private-use characters should be treated like spaces, digits, letters,
punctuation, and so on, by all parties to those private agreements.

For all properties other than General_Category and the normalization-related
properties, the Unicode Character Database provides default values
for private-use characters. These default property values should be considered
informative...

[That text is also proposed text new for Unicode 6.1. It is not already published text
from Unicode 6.0, which only has a very short statement about default property
values for PUA characters.]

Text Issues

The basic text issue raised in the editorial committee is that the draft text in
Section 3.5 is not entirely consistent with the more extensive statement in
Section 16.5. The issue is for General_Category, in particular.

A secondary issue was also raised about the text in Section 16.5: The claim
is that the sentence, "This value is normatively defined and cannot be changed
by private agreement." confuses normativity with overridability. Personally, I
disagree with that assessment, but see how the sentence might be read that
way, so concur that it could use an editorial improvement.

Because the basic text issue concerns Chapter 3, Conformance text, because
in principle, the UTC has already reviewed and agreed upon the text of Chapter 3
for Unicode 6.1, and because the issue is inherently a little tricky, the editorial
committee deemed it advisable to bring the issue to the attention of the UTC
for discussion and resolution.

Suggested Text Changes

To make the discussion a little easier, I'll provide textual emendation suggestions
here, which I think may address the problems noted.

First, for Section 3.5, the main issue is that General_Category default values
are not all overridable by private agreement, and there are important
caveats spelled out in more detail in Section 16.5. My suggested emendation
of the text, then, would be:

Default property values are also provided for private-use characters. Because the
interpretation of private-use characters is subject to private agreement between the
parties which exchange them, the most default property values for those characters
are overridable by higher-level protocols, to match the agreed-upon semantics
for the characters. There are important exceptions for
a few properties.
See Section 16.5, Private-Use Characters.
Then a textual correction for the problematical current text in Section 16.5 could be:

No private agreement can change which character codes are reserved for
private use. However, many Unicode algorithms use the General_Category
property or properties which are derived by reference to the General_Category
property. Private agreements may override the General_Category or
derivations based on it, except where overriding is expressly disallowed in
the conformance statement for a specific algorithm. In other words,
private agreements may define which private-use characters should be
treated like spaces, digits, letters, punctuation, and so on, by all
parties to those private agreements. In particular, when a private agreement
overrides the General_Category of a private-use character from the default
value of gc=Co to some other value such as gc=Lu or gc=Nd, such a change
does not change its inherent identity as a private-use character, but
merely specifies its intended behavior according to the private agreement.

For all other properties the Unicode Character Database also provides default values
for private-use characters. Except for normalization-related properties
these default property values should be considered informative...
Questions

Do those minimal text changes correctly reflect the intention of the UTC regarding this issue
of overriding default property values for PUA characters?

If so, should the editorial committee proceed with those text changes for the Unicode 6.1
Core Specification text?

Are there other textual suggestions for how to solve these issues in a different way?