Unicode Technical Report #23

CHARACTER PROPERTY MODEL

Revision 0d3
Authors Ken Whistler, and Asmus Freytag (asmus@unicode.org)
Date 2000-07-17
This Version http://www.unicode.org/unicode/reports/tr23-0d3.html
Previous Version none
Latest Version none

Summary

This report presents a character property model for the Unicode Standard.

Status of this document

This is the third working draft of the authors - submitted for review by UTC for approval as "Proposed Draft". There are still several loose ends and funny colors that mean things to me (AF).

This is in response to a work item to Ken and myself from two UTC's ago.

<Make real once this is approved as a proposed draft>

Contents

<Make real>

1 Scope
2 Overview 
2.1 Subsection 2.1
2.2 Subsection 2.2
3 Definitions
4 Conformance
5 Maintenance
6 Table of Character Properties 
7. References
Acknowledgements
Revisions

1. Scope

This report intends to address the following aspects of properties

2. Overview

The Unicode Standard views character properties as inherent to the definition of a character. The Standard therefore supplies a rich set of character properties for each character contained in it. For other character set standards, it is left to the implementer, or to unrelated secondary standards, to assign character properties to characters. Character properties for the Unicode Standard are not so much assigned, as "discovered".

2.1 Partitions - General Category

The General Category in the Unicode Character Database is defined as a partition. It has a certain number of values, and every Unicode character must have one of those values. That is just the nature of its definition. But partitions can be very procrustean, indeed, if one attempts to make them convey distinctions they were not designed for. Therefore, each partition of the Unicode character space must be limited to a single, well defined purpose. Because of that restriction, there are now many partitions of the Unicode Character Space, among them:

The values of the bidi-class are frozen by the bidi algorithm. One could reassign a character to a new value from the existing set, if a truly mistaken assignment was somehow found, but adding a new value to the set would invalidate the existing algorithm.

In a similar manner, a new Line Breaking Property would require additions to the set of Line Breaking Rules contained in [Line Breaking]. However, since the line breaking rules in [Line Breaking] are not normative in the way the bidi-algorithm is, and because line breaking rules do get customized in implementations, it is quite conceivable that, as more scripts are added, or as more is learned about handling certain scripts, additional line breaking properties are found to be required. In this case, subdividing the partition would be the right answer.

The general category is special in the sense that some of its values are better defined than others. Letters, digits, combining marks, spaces, and formatting characters are some of the well defined values. The subdivision of the remainder into subclasses is more problematical. The problem with subdividing punctuation into opening and closing is not only that this cannot be uniquely done (see the quotation marks) but that this distinction does not belong into the "general" category; the line breaking partition is a better place for this information.

The UTC decision to not further subdivide the general category ought to be paired with deprecating some of the existing sub-categories with pointers to other partitions or properties files where these classifications belong. In other words, the general categories could be better off if it only contained 'P' for punctuation, but not the overly detailed sub-types, since they tend to give only a partial answer.

All in all, this is really no different to the use of 1:1 case transforms in [UnicodeData] and the existence of a separate SpecialCasing.txt to give the complete answer.

The math property is another good example. It is not derivable from the General Category values now, even though "Sm" is a subtype of the General Category. The only way to get the math property is to separately list it for all characters; and in fact there is such a listing provided now in PropList.txt. (PropList.txt also lists many other character properties which are not derivable from the General Category.)

Letter versus Symbol are not "just varieties of the same specific property." The term "Symbol" is defective in the General Category partition. It is the left-over junk category, and the way the General Category values are assigned, with an implicit, unstated hierarchy of properties, means that "Symbol" is not coherent in any obvious way. Furthermore, there are instances of letters functioning as symbols (e.g., the letterlike symbols we are talking about), and instances of symbols functioning as letters (e.g., some of the modifier letters).

2.2 Notes on "General Category":

The "general category" combines several properties under one umbrella. 

2.3 Normative Properties

What does it mean for a property to be normative? By making a property normative, the Unicode Standard guarantees that conformant implementations can rely on the fact that other conformant will interpret the character in the same way. This is most useful for those properties where the Unicode Standard provides precise rules for the interpretation  of characters based on their properties. An example are the bidirectional properties and their use by the bidirectional algorithm. For some character properties, for example the general category, the implications of a character having a particular property value are less well defined, reducing the degree to which communicating implementations can rely on specific interpretations for the same character. In the absence of a well defined interpretation it is meaningless to claim that a character property is normative.

Issues with Normative Properties: There is some peculiar issue of partitions for which some values are normative and others are not. For the general category, the distinction becomes hard to follow. We are really not clear on this. There are many processes that we have not well defined (unlike bidi and collation). We assert the general property, but we don't make clear what model of processing we intend and what the required consequences are of a character being "Letter Other" as opposed to "Symbol Other".

Alternatives: where only a few characters have normative properties, such as no-break characters for line breaking, explicitly duplicate that property in a separate normative list of (individual, non partition) properties

2.4 Informative Properties

Properties may be informative for two main reasons.

  1. The nature of the property or the precise set of characters to which it applies are not well known and it therefore is too early to assign a normative property. Even if there was a precise description of how to interpret such a property, the fact that it is subject to a (planned) revision makes it less interesting to communicating implementations to rely on the specified behavior.
  2. Existing implementations show a range of behaviors for the same character, many or all of which may be equally useful choice on the part of their designers. Assigning a normative property would imply a unwarranted restriction on existing and established practice.

2.5 Partition sets (Required properties)

Some operations require that a unique property value from a given category is applied to all characters in the Unicode Standard. In other words, these values form a partition over the set of characters. Examples are the general categories, and the bidirectional classes.

Issues with Partitions: If a partition is cobbled together from non-overlapping...

But partitions can be very procrustean, indeed, if you attempt to make them convey distinctions they were not designed for. Either you keep subdividing and subdividing them into smaller subsets, using more values (which the UTC has decided not to do for the General Category), or you find another way.

2.6 Subdivision

Multi-valued properties where we now have partitions: General category is the worst offender since it does not limit its model (unlike bidi or linebreaking). Therefore it becomes hard to determine what is appropriate when controls are both controls and also whitespace (THAT was where we got started with the UTC discussion).

Perhaps we formally redesignate the General Category as 'primary' cat of a character and duplicate *all* of them in a set of binary properties ala Proplist, but with possibly overlapping values. This seems on the face of it more helpful to the programmers who have to solve the 'isxxx()' question in a process dependent and very concrete model, not the average case presupposed by the general category.

3. Definitions

PD1. Character Property
A character property is a set of values that can be applied to some or all Unicode characters.
 
PD2. Property Value
One of the set of values associated with a character property.

For example, the [East Asian Width] property has the possible values "Narrow", "Neutral", "Wide", "Ambiguous" and "Unassigned".
 
PD3. Universal (Required) Property
A universal property applies to all Unicode characters.
 
PD4. Limited Property
Sometimes, a given property does not make sense for a large number of characters, for example case, and case mapping information is not needed for unicameral scripts. Such information can in principle be left 'undefined' for the characters to which it does not apply. Such a property is called a limited property and applies to only a subset of Unicode characters, usually one script, or a related family of scripts. It is trivially possible to turn a limited property into a universal one by giving a special 'does-not-apply' value to all characters to which the limited property does not apply. This is sometimes done to turn a limited property into a partition.
 
PD5. Partition
A universal, enumerated property creates a partition over the set of Unicode characters: each character is associated with precisely one value from a fixed set of property values.
 
PD6. Single Valued (Boolean) Property
A singled valued property can be represented by a Boolean property with 'false' assigned to all characters to which the property does not apply. Essentially the presence or absence of the property is the important information.
 
PD7. Enumerated Property
A property with a fixed set of values. As characters are added to the Unicode Standard, the set of values may need to be extended in the future, but it is advantageous to think of enumerated properties of having a fixed set of possible values.
 
PD8. Unconstrained Property
An unconstrained property can take on any integer or real value. An example is the numeric value property. There is no implied limit to the number of possible distinct values for the property, short of the limitations of representing integers or real numbers in computers.
 
PD9. Normative Property
A normative property has conformance implications, see 4. Conformance. A normative character property must be paired with a precise description of how to interpret characters with each normative property value in a conformant way.

For example, the interpretation of the bidirectional class is precisely defined in [Bidirectional Algorithm].
 
PD10. Informative Property
An informative property is provided as helpful information to implementers. There are no requirements to implementations of the Unicode Standard.

4. Conformance

Character properties may be either normative or informative. Normative means that implementations that claim conformance to the Unicode Standard (at a particular version) and which make use of a particular property or field must follow the specifications of the standard for that property or field in order to be conformant. The term normative when applied to a property or field of the Unicode Character Database, does not mean that the value of that field will never change. Corrections and extensions to the standard in the future may require minor changes to normative values, even though the Unicode Technical Committee strives to minimize such changes. An informative property or field is strongly recommended, but a conformant implementation is free to use or change such values as it may require while still being conformant to the standard. Particular implementations may choose to override the properties and mappings that are not normative. In that case, it is up to the implementer to establish a protocol to convey that information.

5. Updating Properties and Extending the Standard 

5.1 Guarantees:

We need to make clear when and to what degree these guarantees apply.
some of them are a function of the character (and how well understood it is) and less a function of normative/informative

5.2 Properties that are Unknown

for many archaic scripts (as well as for not yet fully implemented modern ones) essential characteristics of many characters may not be knowable at the time of their publication. We need an 'unknown at time of publication' value to avoid ha ving to override assignments all the time.

5.3 Handling Preliminary Assignments

A binary property "preliminary" could also be used to alert implementers that property assignments are not final for some combination of property and character.

5.4 File syntax

All files need a (formal) syntax description at least as detailed as namelist.txt has - it's probably useful to set up three or four templates for this now, so that we don't get new formats for each proposed property.

(In all but UnicodeData, names should be strictly optional, and we should have utilities that can strip these fields.)

For UnicodeData we could either do the same, or pop a new file for the new planes with either 5 digits or an understood plane offset. Splitting that file would be nice since it will get truly big in no time.

5.5 Types of files:

There may be a need to formally discuss 'descriptive files' where small sets of complex exceptional cases are discussed in what amounts to non-machine readable text.

6. Table of Character Properties

The following table attempts to list all character properties defined by the Unicode Standard and associated Unicode Standard Annexes, Unicode Technical Standards, or Unicode Technical Reports.

Name Where Specified N/I Type Notes
Alias [NamesList] I Text see [NamesList-Format]
Alphabetic [Unicode], Section 4.10 I Boolean  
Arabic and Syriac Shaping Class [ArabicShaping] N Enum  
Bidirectional Class [UnicodeData], Field 4 N Partition see [Bidirectional Algorithm]
Canonical Decomposition [UnicodeData], Field 5 N Code Sequence no prefix, see also Compatibility Mapping
Case Derived N Enum General Category = {Lu, Ll, Lt}
Case Folding [CaseFolding] I Code value see [Case Mapping]
Case Mapping (Lower) [UnicodeData], Field 13 N Code value only 1:1 mappings that are locale independent
Case Mapping (Upper) [UnicodeData], Field 12 N Code value only 1:1 mappings that are locale independent
Case Mapping (Title) [UnicodeData], Field 14 N Code value only 1:1 mappings that are locale independent 
Case Mapping (Special) [SpecialCasing] I Code sequence see [Case Mapping]
Code Value [UnicodeData], Field 0 N Code value  
Combining Class [UnicodeData], Field 3 N 0..255  
Comments [NamesList] I Text see [Nameslist-Format]
Compatibility Mapping [UnicodeData], Field 5 I Code Sequence* * includes prefix indicating mapping type
Control Derived N Boolean General Category = Cc
Cross Mappings Various I   see [Character Mapping Tables]
Dashes [Unicode], Table 6-2 N Boolean 207B, 208B, 2212 | General Category = Pd 
Default Sort Weight Various      ee [Collation]
Digit, Decimal Derived N Boolean General Category = Nd
Digit, Decimal Value [UnicodeData], Field 6 N 0..9  
Digit, Value [UnicodeData], Field 7 N Integer  
East Asian Width [EastAsianWidth] I Enum see [East Asian Width]
General Category [UnicodeData], Field 2 N/I Enum see [UnicodeData-Format]
Identifier Extend Derived I Boolean General Category = Mn | Mc | Nd | Pc | Cf
Identifier Start Derived I Boolean General Category = Lu | Ll | Lt | Lm | Lo | Nl
Ideographic Derived I Boolean  
ISO comments [UnicodeData], Field 11 N Text  
Jamo Short Name [Jamo] N Text  
Letter Derived I Boolean see [Unicode], Section 4.10
Line Breaking Property [LineBreak] I Enum  
Mathematical Property [Unicode], Section 4.9 I Boolean  
Mirrored [UnicodeData], Field 9 N Boolean  
Name [UnicodeData], Field 1 N Text  
Numeric Value [UnicodeData], Field 8 I Real  
Ideographs, Primary Numeric [Unicode], Table 4-7 I Integer  
Ideographs, Accounting Numbers [Unicode], Table 4-8 I Integer  
Private Use Derived N Boolean General Category = Cp
Related characters [NamesList] I Text  
Script [ScriptNames] I   see [Script Names]
Space Derived N Boolean General Category = Zs
Surrogates Derived N Text General Category = Cs
Unicode 1.0 Name [UnicodeData], Field 10 N Text  

Notes on the table

Derived Properties

Character properties that are noted derived can be implied from other character properties. For practical and historical reasons, the [UnicodeData] file is considered as the primary source of information in determining the direction of such 'derivation'.

Multiply specified properties

Some properties are maintained in or may be inferred from more than one location. As a result, several correlations between properties are true by design:

Decimal Digit Value present :=: General category = Nd
Decimal Digit Value present :=: Decimal Digit Value = Numeric Value
Digit Value present :=: Digit Value = Numeric Value

 

7. References

[ArabicShaping]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/ArabicShaping.txt>
[Bidirectional Algorithm]
Mark Davis, Unicode Standard Annex #9: The Bidirectional Algorithm, <http://www.unicode.org/unicode/reports/tr9>
[CaseFolding]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt>
[Case Mapping] 
Mark Davis, Unicode Technical Report #21: Case Mapping, <http://www.unicode.org/unicode/reports/tr21>
[Character Mapping Tables]
Mark Davis, Unicode Technical Report #22: Character Mapping Tables, <http://www.unicode.org/unicode/reports/tr22>
[Collation]
Mark Davis, Unicode Technical Report #10: Collation, <http://www.unicode.org/unicode/reports/tr10>
[EastAsianWidth]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt>
[East Asian Width
Asmus Freytag, Unicode Standard Annex #11, East Asian Width, <http://www.unicode.org/unicode/reports/tr11>
[Jamo]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/Jamo.txt>
[LineBreak]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/LineBreak.txt>
[Line Breaking]
Asmus Freytag, Unicode Standard Annex #14: Line Breaking Properties, <http://www.unicode.org/unicode/reports/tr14>
[NamesList]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/NamesList.txt>
[NamesList-Format]
Readme file <ftp://ftp.unicode.org/Public/UNIDATA/NamesList.html>
[SpecialCasing]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt>
[Unicode]
The Unicode Standard, Version 3.0, Addison Wesley Longman, 2000.
[UnicodeCharacterDatabase]
Readme file, <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html>
[UnicodeData]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>
[UnicodeData-Format]
Readme file <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html>

 

Changes from previous drafts

First draft submitted to UTC.


Copyright © 2000-2000 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.