L2/05-197

 
Re: Suggested changes in the Character Property Model (#23)
From: Mark Davis
Date: 2007.07.28

I suggest the following changes in the Character Property Model (http://www.unicode.org/reports/tr23/).

1. To make the definitions more accessible, supply an example for each one. This can be very short, just a sentence.

2. Make the following additional changes. Some of these are corrections, some clarifications, and some additional definitions that are useful in discussing Unicode properties and implementations.

 

Current Suggested addition
PD7. Single Valued (Boolean) Property
A closed enumerated property whose set of values is limited to 'true' and 'false'.
For a given boolean property P, the phrase "the P code points" denotes the set of all code points whose property value for P is 'true'. For example, the Pattern_Whitepace code points are those with the Pattern_Whitespace property value 'true'. Similarly, for a given property P and value V, the phrase "the P V code points" denotes the set of all code points whose property value for P is V. For example, the Line Break Alphabetic code points are all for which the Line Break property value is Alphabetic.

 

Current Suggested replacement
PD16. Context-independent Property
A property that applies to a code point in isolation.
 
PD17. Context-dependent Property
A property that applies to a code point in the context of a longer code point sequence. 

See also PD33: Context-dependent String Function.
PD16. Context-Dependent Property
A property whose value may vary depending on the context surrounding the character.

For example, the character property Final_Sigma used in Table 3-13 depends on characters before and after the character in question.

PD16. Context-Independent Property
A property that is not context-dependent.
 
For example, the general category code point property does not depend on surrounding code points.
 
Current Suggested addition
PD21. Immutable Property
A fixed property that is also subject to a stability guarantee preventing any change in the published listing of property values.

An immutable property is trivially stable with respect to all context-free algorithms. Example of immutable properties are the code point and Unicode character name.
[Add note:]

Note: an immutable character property is different than an immutable code point property. For example, Pattern_Syntax is an immutable code point property: it encompasses a fixed set of code points that will never change. However, it is not an immutable character property: an unassigned code point in that range may be allocated as a character in the future.


 

Current Suggested addition
PD27. Property Value Alias
A unique identifier for a particular enumerated value for a particular property. [D10a]

It is only unique within the set of values for that property, not across properties. Thus AL is the Alphabetic value for the Line Break property, but is the Arabic Letter value for the Bidi Class property.

 

Current Suggested addition

3.7 Classification of String Functions

  • Properties of strings and string functions extend to code points. For example, a code point is Inert with respect to a transform if and only if the string containing that code point is Inert with respect to that transform.
  •  

    Current Suggested replacement [plus reordering, as above]
    PD34. Context-dependent String Function
    Given a string S, and offsets a and b, a context-dependent string function is any string function F for which F(S,a,b) depends on the content of S before a and after b.
    PD34. Context-dependent String Function
    A context-independent string function is a string function that is not context-dependent.
     
    Current Suggested replacement
    PD35. Idempotent String Function
    A string function F whose output F(S) is itself a string, with the property that repeated applications of the same function F produce the same output: F(F(S)) = F(S) for all input strings S.
     
    PD36. Folding Function
    A folding function is an idempotent string function that establishes a set of equivalence classes that partitions all strings, where XY if and only if F(X) = F(Y). For each equivalence class, a folding defines a target member. Applying the folding replaces the input by the target member.
     
    A well known example of a folding function is case folding. For case folding, the equivalence class consists of all case variations, including upper, lower, title case and mixed case. The target member is often chosen to be the lower case.
     
    Folding functions may be context dependent. Normalization is an example of a context dependent folding.
    PD35a. String Transform
    A string-valued string function
     
    PD36b. Idempotent String Transform
    A string transform F, with the property that repeated applications of the same transform F produce the same output: F(F(S)) = F(S) for all input strings S. Such a string transform is also called a folding.
     
    A folding establishes an equivalence relation, whereby XY if and only if F(X) = F(Y). This equivalence relation partitions the set of all strings into the set of equivalence classes for the relation.
    Conversely, any partition of strings can be used to generate a folding, by choosing one element of each partition to be the "target member" that the members of that partition map to. For examples, see PDx4 Closed.
     
    It is common to use the syntax toX(s) for the folding, and isX(s) for the corresponding binary function, defined such that isX(s) if and only if toX(s) = s. For example, toNFC() is the folding that converts to NFC format, while isNFC() is the test for whether a string is in that format.

     

    Suggested additions (not necessarily in this order)
    PDx0. Preservation
    A transform T preserves a property P when for all strings S, P(S) if and only if P(T(S)). A transform T preserves a relation R when for all strings S1 and S2, R(S1, S2) = R(T(S1), R(T(S2))
     
    For example, concatenation does not preserve normalization form, nor collation order. However, the substring operation does preserve normalization: if S is normalized, then S[x,y] is normalized.

    Under certain conditions, strings and boundaries are "inert" with respect to a given transform. This property can often be used in optimizing code, by skipping over characters or detecting conditions where fast paths can be taken in code.

    PDx1. Inert String
    A string S is inert with respect to a string transform T when the string is always unchanged by the transform, and never affects the results for the surrounding characters. More formally, S is inert w.r.t. T when for all strings x and y, T(x) + T(S) + T(y) = T(x + S + y).

    Examples: with respect to NFD, the character 'a' is inert. The <combining diaeresis> is not, since

    toNFD(<combining diaeresis>, <combining cedilla>)

    toNFD(<combining diaeresis>) + toNFD(<combining cedilla>).

    Implementations can often use tests for inert characters in optimizing.

    PDx2. Inert Boundary

    A text boundary property P is inert with respect to a string transform T when the boundary is unchanged by the transform. More formally, P is inert w.r.t. T when for all strings x and y such that T(x) + T(y) = T(x + y), P(x + y, length(x)) = P(T(x,y), length(T(x))
     
    For example, grapheme cluster boundaries are inert with respect to all of the normalization forms. Line break boundaries, however, are not.
     
     
    PDx3. Final-Inert String
    A string S is final-inert with respect to a string transform T when the string is always unchanged by the transform, and never affects the results for following characters. More formally, S is final-inert w.r.t. T when for all strings x and y, T(x + S) + T(y) = T(x + S + y).
    PDx1. Initial-Inert String
    A string S is initial-inert with respect to a string transform T when the string is always unchanged by the transform, and never affects the results for preceding characters. More formally, S is initial-inert w.r.t. T when for all strings x and y, T(x) + T(S + y) = T(x + S + y).

    For example, these properties can be used for an optimized normalization concatenation. Normal string concatenation does not preserve normalization. Thus the concatenation of two normalized strings A and B is not guaranteed to be normalized. However, it is easy to write an optimized normalized concatenation by breaking A into two parts A' and A" (where A' ends with the last final-inert character in A), and breaking B into two parts B' and B" (where B" starts with the first initial-inert character in B), then returning A' + normalize(A" + B') + B".

    PDx4. Closed
    The set S is closed under a relation R if for all elements x and y in S, if x is in X and x R y, then y is in X.
     
    The closure of a set of strings is often useful in implementations. For example, in implementing collation it is useful to pre-generate collation weights for the closure of each of the tailored strings under NFD; that makes it unnecessary to normalize the text at runtime in most cases.
     
    A relation can be used to generate a partition of elements where each partition is a minimal set of elements closed under R. That partition can then be used to generate a folding. For example, the closure of the relation x = toUppercase(y) OR x = toLowercase(y) OR x = toTitlecase(y) is used used to generate the data for the Unicode case folding. Each partition contains all possible case variations of a string including upper, lower, title case and mixed case. The target member is chosen to be all lowercase.