[Unicode]   Unicode Home Page | Unicode Technical Reports | Versions of the Unicode Standard | Character Charts

 L2/00-401

{Proposed Draft} Unicode Technical Report #23

CHARACTER PROPERTY MODEL

Revision 0d6
Authors Ken Whistler, and Asmus Freytag (asmus@unicode.org)
Date 2000-11-08
This Version http://www.unicode.org/unicode/reports/tr23-0d6.html
Previous Version none
Latest Version none

Summary

This report presents a character property model for the Unicode Standard.

Status of this document

This is the sixth working draft of the authors. This version is presented to UTC#85 for discussion. The outcome of this discussion could be to get consensus on
  1. the definitions
  2. the stability guarantees
  3. the maintenance process
  4. the status of and rules governing default properties

with this the authors can then proceed to draft the necessary introductory and explanatory language, as well as to complete the table to encompass any new properties recently added to the Unicode character database.

<sections marked like this contain reminders by the authors to themselves>

[Editorial notes for the benefit of reviewers are indicated like this.]

<Replace this section with the formal status language once this is approved as a proposed draft>

Contents

<Make real once the material has stabilized>

1 Scope
2 Property Model 
2.1 Subsection 2.1
2.2 Subsection 2.2
3 Definitions
4 Conformance
 
5 Issues With Certain Properties
6 Table of Character Properties 
6 Updating Character Properties
5 Data Management and Distribution
7. References
Acknowledgements
Revisions

1. Scope

The Unicode Standard views character semantics as inherent to the definition of a character.  The assignment of character semantics for the Unicode Standard is based on character behavior. For other character set standards, it is left to the implementer, or to unrelated secondary standards, to assign character semantics to characters. In contrast, the Unicode  Standard supplies a rich set of character attributes and properties for each character contained in it.  Many properties are specified in relation to processes or algorithms that interpret them, in order to implement the discovered character behavior.

This report specifically covers formal character properties, which are those attributes of characters that are specified according to the definitions set forth in this report. 

[The intention is to address the following aspects of properties...

<complete or remove this list>

2. Property Model

[These are the points from the whiteboard  for the discussion at UTC#84 in Bedford]

raisin deter

character behavior

interpreted by process => formal property

property lists

property types

attrib to char/ code point

default values / does not apply / status unclear

[These is the text that I wrote with the above points in mind (AF). It's pretty rough still...AND it needs to change again, since the definitions got reworked]

2.1 Origin of Character Properties

In the Unicode Character Property Model, properties are inherent in the character. When modeling character behavior with computer processes, formal character properties are assigned in order to achieve the expected results. 

2.2 Formal Character Properties

<TBD>

2.3 Normative Properties

What does it mean for a property to be normative? By making a property normative, the Unicode Standard guarantees that conformant implementations can rely on the fact that other conformant will interpret the character in the same way. This is most useful for those properties where the Unicode Standard provides precise rules for the interpretation  of characters based on their properties. An example are the bidirectional properties and their use by the bidirectional algorithm. For some character properties, for example the general category, the implications of a character having a particular property value are less well defined, reducing the degree to which communicating implementations can rely on specific interpretations for the same character. In the absence of a well defined interpretation it is meaningless to claim that a character property is normative.

Note: one trivial, but important instance of conformant implementation is runtime access to a character property database. For normative properties, conformant implementations guarantee that the returned values match the values defined by the Unicode Consortium.

Alternatives: where only a few characters have normative properties, such as no-break characters for line breaking, explicitly duplicate that property in a separate normative list of Boolean properties

2.4 Informative Properties

Properties may be informative for two main reasons.

  1. The nature of the property or the precise set of characters to which it applies are not well known and it therefore is too early to assign a normative property. Even if there was a precise description of how to interpret such a property, the fact that it is subject to a (planned) revision makes it less interesting to communicating implementations to rely on the specified behavior.
  2. Existing implementations show a range of behaviors for the same character, many or all of which may be equally useful choice on the part of their designers. Assigning a normative property would imply a unwarranted restriction on existing and established practice.

2.5 Issues:

Issues with overloaded enumerations: If an enumerated property is cobbled together from non-overlapping Boolean properties, the result may be difficult to apply or extend. The same applies if one attempts to use enumerated properties to convey distinctions they were not designed for. One either needs to keep subdividing and subdividing them into smaller subsets, using more values (something that cannot be done for a closed partition), or one must define alternative properties.

Issues with Boolean properties: If multiple Boolean properties are used to capture what are in effect mutually exclusive assignments of an enumerated value (for example the Boolean clones of the bidi properties in the PropList file) an essential fact, the mutual exclusiveness, can no longer be expressed in the property itself.

Issues with Normative Properties: There is some peculiar issue of partitions for which some values are normative and others are not. For the general category, the distinction becomes hard to follow. We are really not clear on this. There are many processes that we have not well defined (unlike bidi and collation). We assert the general property, but we don't make clear what model of processing we intend and what the required consequences are of a character being "Letter Other" as opposed to "Symbol Other".

[ The UTC should focus on reviewing these definitions. ]

3. Definitions

PD1. Character Property
A character property defines a set of values and a mapping from each Unicode code points to one of the values of the set.
 
PD2. Property Value
One of the set of values associated with a character property.

For example, the [East Asian Width] property has the possible values "Narrow", "Neutral", "Wide", "Ambiguous" and "Unassigned".
 
PD4. Limited Property
Sometimes, a given property does not make sense for a large number of characters, for example case, and case mapping information is not needed for unicameral scripts. Such information can in principle be left 'undefined' for the characters to which it does not apply. Such a property is called a limited property and applies to only a subset of Unicode characters, usually one script, or a related family of scripts. Limited properties are implemented by giving a special 'does-not-apply' value to all characters to which the limited property does not apply. 
 
PD3. Universal (Required) Property
A universal property applies to all Unicode characters. A universal property does not have a 'does not apply value'.
 
PD4. Enumerated Property
A property with a fixed set of values. This is sometimes also known as a partition.
 
As characters are added to the Unicode Standard, the set of values may need to be extended in the future, but it is advantageous to think of enumerated properties of having a fixed set of possible values.
 
PD5. Closed Enumeration
An enumerated property for which the set of values is closed (i.e. it may not be extended for future versions of the Unicode Standard).
 
Note: Currently, the General Category is the only closed enumeration, other than Boolean properties.
 
PD6. Single Valued (Boolean) Property
A closed enumerated property whose set of values is limited to 'true', 'false', [and  possibly 'undetermined.(See Draft of section 8)]
Essentially the presence or absence of the property is the important information.
 
PD7. Integral Property
An integral property can take on any integer or real value. An example is the decimal digit value property. There is no implied limit to the number of possible distinct values for the property, short of the limitations of representing integers in computers.
 
PD8. General Numeric Property
A general numeric property can take on any integer, real, or complex value. An example is the numeric value property. There is no implied limit to the number of possible distinct values for the property, short of the limitations of representing integers, real or complex numbers in computers.
 
PD9. Normative Property
A normative property has conformance implications, see 4. Conformance. A normative character property must be paired with a precise description of how to interpret characters with each normative property value in a conformant way.
Note: A normative process that depends in a normative and testable way on a property, causes the property to be normative. For example, the interpretation of the bidirectional class is precisely defined in [Bidirectional Algorithm].
If a process does not interpret a given character, it may remain unaware of its properties - but is is recommended that processes use carefully chosen default values for characters that they don't handle.
PD10. Informative Property
An informative property is provided as helpful information to implementers. There are no requirements to implementations of the Unicode Standard.
 
Note: Informative properties capture expert implementation experience and their use is strongly recommended by the Consortium.
 
PD11. Simple Property
A property that applies to a character in isolation.
 
PD12. Character Behavior
A property that applies to a character in context of a longer character sequence
 
PD13. Stable Property
A property is stable with respect to a particular algorithm or process, if changes in the assignment of property values produce no changes in the outcome of the process or algorithm.
 
For example, while the absolute values of the canonical combining classes are not guaranteed to be the same between versions of the Unicode Standard, their relative values will be maintained. As a result, they are stable with respect to the Normalization Forms as defined in [Normalization].
 
PD14. Immutable Property
A property whose values, once assigned to a character, are fixed and will not be changed.   

An example of immutable, or fixed, properties are the code position and name of each Unicode character.
 
PD15. Overridable Property
A property whose values may be overridden by a higher level protocols.
PD16. Default Value
Value of a property to be used when encountering unassigned or unsupported characters. There may be more than one default value per property.
 
 

4. Conformance related considerations

[To be rewritten based on the final wording of the definitions.]

Character properties may be either normative or informative. Normative means that implementations that claim conformance to the Unicode Standard (at a particular version) and which make use of a particular property or field must follow the specifications of the standard for that property or field in order to be conformant. The term normative when applied to a property or field of the Unicode Character Database, does not mean that the value of that field will never change. Corrections and extensions to the standard in the future may require minor changes to normative values, even though the Unicode Technical Committee strives to minimize such changes. An informative property or field is strongly recommended, but a conformant implementation is free to use or change such values as it may require while still being conformant to the standard. Particular implementations may choose to override the properties and mappings that are not normative. In that case, it is up to the implementer to establish a protocol to convey that information.

5. Table of Character Properties

The following table attempts to list all character properties defined by the Unicode Standard and associated Unicode Standard Annexes, Unicode Technical Standards, or Unicode Technical Reports.

[Largely unchanged. The plan is to add the information specified in the PropList file in this table as well.]

Name Where Specified N/I Data Type Notes
Alias [NamesList] I Text see [NamesList-Format]
Alphabetic [Unicode], Section 4.10 I Boolean  
Arabic and Syriac Shaping Class [ArabicShaping] N Enum  
Bidirectional Class [UnicodeData], Field 4 N Enum see [Bidirectional Algorithm]
Canonical Decomposition [UnicodeData], Field 5 N Code Sequence no prefix, see also Compatibility Mapping
Case [UnicodeData], Field 4 = 
Lu | Ll | Lt
N Enum Derived: 
General Category
=  Lu | Ll | Lt
Case Folding [CaseFolding] I Code point see [Case Mapping]
Case Mapping (Lower) [UnicodeData], Field 13 N Code point only 1:1 mappings that are locale independent
Case Mapping (Upper) [UnicodeData], Field 12 N Code point only 1:1 mappings that are locale independent
Case Mapping (Title) [UnicodeData], Field 14 N Code point only 1:1 mappings that are locale independent 
Case Mapping (Special) [SpecialCasing] I Code sequence see [Case Mapping]
Code point [UnicodeData], Field 0 N Code point  
Combining Class [UnicodeData], Field 3 N 0..255  
Comments [NamesList] I Text see [Nameslist-Format]
Compatibility Mapping [UnicodeData], Field 5 I Prefix +
Code Sequence
prefix indicating mapping type
Control [UnicodeData], Field 4 = Cc N Boolean Derived: General Category = Cc
Cross Mappings Various I   see [Character Mapping Tables]
Dashes [Unicode], Table 6-2 N Boolean 207B, 208B, 2212 | General Category = Pd 
Default Sort Weight Various      See [Collation]
Digit, Decimal [UnicodeData], Field 4 = Nd N Boolean Derived: General Category = Nd
Digit, Decimal Value [UnicodeData], Field 6 N 0..9  
Digit, Value [UnicodeData], Field 7 N Integer  
East Asian Width [EastAsianWidth] I Enum see [East Asian Width]
General Category [UnicodeData], Field 2 N/I Enum see [UnicodeData-Format]
Identifier Extend [UnicodeData], Field 4 = 
Mn | Mc | Nd | Pc | Cf 
I Boolean Derived: 
General Category
= Mn | Mc | Nd | Pc | Cf 
Identifier Start [UnicodeData], Field 4 = 
Lu | Ll | Lt | Lm | Lo | Nl
I Boolean Derived: 
General Category
= Lu | Ll | Lt | Lm | Lo | Nl
Ideographic Derived I Boolean  
ISO comments [UnicodeData], Field 11 N Text  
Jamo Short Name [Jamo] N Text  
Letter Derived I Boolean see [Unicode], Section 4.10
Line Breaking Property [LineBreak] I Enum  
Mathematical Property [Unicode], Section 4.9 I Boolean  
Mirrored [UnicodeData], Field 9 N Boolean  
Name [UnicodeData], Field 1 N Text  
Numeric Value [UnicodeData], Field 8 I Real [Complex]  
Ideographs, Primary Numeric [Unicode], Table 4-7 I Integer  
Ideographs, Accounting Numbers [Unicode], Table 4-8 I Integer  
Private Use [UnicodeData], Field 4 = Cp N Boolean Derived: General Category = Cp
Related characters [NamesList] I Text  
Script [ScriptNames] I [Text / Enum?] see [Script Names]
Space [UnicodeData], Field 4 = Zs N Boolean Derived: General Category = Zs
Surrogates [UnicodeData], Field 4 = Cs N Text Derived: General Category = Cs
Unicode 1.0 Name [UnicodeData], Field 10 N Text  

Notes on the table

Code points and code sequences

Code points can range from 0000 to 10FFF and code sequences are sequences of code points. There is no pre-determined maximum length for a code sequence.

Derived Properties

Character properties that are noted derived can be implied from other character properties. For practical and historical reasons, the [UnicodeData] file is considered as the primary source of information in determining the direction of such 'derivation'. The notation Field = X | Y means that the value of the field is either X or Y for the property to be true.

Cross correlations between properties

Some properties are maintained in or may be inferred from more than one location. As a result, several correlations between properties are true by design:

Decimal Digit Value present :=: General Category = Nd
Decimal Digit Value present :=: Decimal Digit Value = Numeric Value
Digit Value present :=: Digit Value = Numeric Value
[More to be added]

6. Notes on Particular Properties

[The purpose of this section is not quite clear - it started as a critique of the 'general  category' and is changing into a description of what that is. Should this section eventually contain all the descriptions of properties not defined in the book or a TR? ]

6.1 General Category

[NEW TEXT:]

The General Category is a broad categorization of all character according to their principal use. It is specifically designed to support a wide variety of common parsing task, including, but not limited to identifier syntax, regular expression processing, and word boundary detection. Many tasks will require specific overrides, or specializations for some characters, in some cases different overrides, dependent on locale. For some tasks, the specializations needed were extensive enough to warrant a separate property, for an example see [Line Breaking] . In other cases, alternative categorizations are needed that overlap some of the General Categories values, but not others, for example the Mathematical Property is true for all characters with General Category = Sm, but the reverse is not true.

[OLD TEXT:] <see if there is anything that needs to be retained, otherwise delete>

Enumerated properties can be very procrustean,  if one attempts to make them convey distinctions they were not designed for. Therefore, each of the following enumerated properties that partition of the Unicode character space are be limited to a single, well defined purpose. 

The values of the bidi-class are frozen by the bidi algorithm. One could reassign a character to a new value from the existing set, if a truly mistaken assignment was somehow found, but adding a new value to the set would invalidate the existing algorithm.

In a similar manner, a new Line Breaking Property would require additions to the set of Line Breaking Rules contained in [Line Breaking]. However, since the line breaking rules in [Line Breaking] are not normative in the way the bidi-algorithm is, and because line breaking rules do get customized in implementations, it is quite conceivable that, as more scripts are added, or as more is learned about handling certain scripts, additional line breaking properties are found to be required. In this case, subdividing the partition would be the right answer.

The general category is special in the sense that some of its values are better defined than others. Letters, digits, combining marks, spaces, and formatting characters are some of the well defined values. The subdivision of the remainder into subclasses is more problematical. The problem with subdividing punctuation into opening and closing is not only that this cannot be uniquely done (see the quotation marks) but that this distinction does not belong into the "general" category; the line breaking partition is a better place for this information.

Some of the existing sub-categories of the general category should be deprecated with pointers to other partitions or properties files where these classifications belong. In other words, the general categories could be better off if it only contained 'P' for punctuation, but not the overly detailed sub-types, since they tend to give only a partial answer. All in all, this is really no different to the use of 1:1 case transforms in [UnicodeData] and the existence of a separate SpecialCasing.txt to give the complete answer.

Letter versus Symbol are not "just varieties of the same specific property." The term "Symbol" is defective in the General Category partition. It is the left-over junk category, and the way the General Category values are assigned, with an implicit, unstated hierarchy of properties, means that "Symbol" is not coherent in any obvious way. Furthermore, there are instances of letters functioning as symbols (e.g., the letterlike symbols), and instances of symbols functioning as letters (e.g., some of the modifier letters).

7. Updating Properties and Extending the Standard 

[The material in this section is under active discussion in the Unicode Consortium. The text below represents an initial suggestion by the authors and does not reflect a consensus by the Consortium.  These suggestions may be accepted or rejected by the Unicode Technical Committee.]

Updates to the Unicode Character Database can be required for three reasons

  1. To cover new characters added to the Unicode Standard
  2. To add new properties
  3. To change the assigned values for a property for some characters

Changing a characters property assignment invalidates existing implementations and is therefore something that is done judiciously and with great care when there is no better alternative.

When updates are made, certain guarantees apply.

7.1 Guarantees

Some aspects of properties are guaranteed to be invariant for all or some properties

Where stability of assignment is not guaranteed, the stability of an assignment of a property value will depend on the character and how well understood it is and will be less a function of normative vs. informative status.

[Capture property related guarantees here from 'policies' page on the website - in the view of the authors, the formal specification of these guarantees belongs in this report, and the web-page will serve merely as a summary. This matter is under discussion. In any event, the material needs to be reworded in light of the definitions that are available here (i.e. 'stable property').]

 

8. Special Property Values

[The material in this section is under active discussion in the Unicode Consortium. The text below represents an initial suggestion by the authors and does not reflect a consensus by the Consortium. These suggestions may be accepted or rejected by the Unicode Technical Committee.]

8.1 N/A Value

Limited properties apply to only a subset of characters. Where these properties are implemented as a partition (required property) the characters to which the property does not apply is given a special value denoting that the property does not apply.

8.1 Default Value

All implementations of the Unicode Standard should endeavor to handle additions to the character repertoire gracefully. In some cases this may require that an implementation attempts to 'anticipate' likely property values for Code points for which characters have not yet been defined, but where surrounding characters exist that make it probable that similar characters will be assigned to the Code point in question.

There are three strategies

  1. Rely on a recommendation from The Unicode Consortium. For example, for the Bidirectional Class, the Unicode Consortium has published recommended default values for all Code points.
  2. Treat the unassigned areas of a block with as if it had a property value common to other characters of the block. A variation of this scheme bridges 'holes' in the allocation by looking at the property value for characters bracketing the hole.
  3. Give unassigned code location a default property that will result in graceful, if not completely correct behavior if encoded characters are later encountered at that location.

Each of these strategies has advantages and drawbacks, and none can guarantee that the behavior of an implementation that is conformant to a prior version of the Unicode Standard will support characters added in a later version of the Unicode Standard in precisely the same way as an implementation that is conformant to the later version. The most that can be hoped for, is that the earlier implementation will behave gracefully in such circumstances.

8.2 Undetermined Property Values

For many archaic scripts (as well as for not yet fully implemented modern ones) essential characteristics of many characters may not be knowable at the time of their publication. In these cases the proper assignments of property values for newly encoded characters cannot be reliably determined at the time the characters are first added to the Unicode Standard, or for a new property, when the property is first added to the Unicode Character Database. In these cases, and where the property is a required property, it will be given a value of 'undetermined', or 'unknown at time of publication'.

8.3 Preliminary Property Assignments

Sometimes, a determination and assignment of property values can be made, but the information on which it was based may be incomplete or preliminary. In such cases, the property value may be changed when better information becomes available. Currently, there is no machine readable way to provide information about the confidence of a property assignment; however, the text of the Standard or a Technical Report defining the property may provide general indications of preliminary status of property assignments where they are known.

9. Data Management and Distribution

[This needs more discussion. The plan for this section is to include a description of the process used to maintain the Unicode Character Database. An initial draft follows:]

The Unicode Character Database is provided as a collection of flat text files, as described in [UnicodeCharacterDatabase]. Each version of the database has its own, numbered directory on the Unicode ftp site [ftp-http]. In each versioned directory, the filenames of all files carry their version number in the file names. The latest version of the database is replicated in the directory named UNIDATA [Unidata], which uses constant file names (i.e. without version number). It is thus possible to reference both a specific version as well as the latest version of a database file (or the database itself) without need to update the links.

Whenever an update of the database is being developed, a beta version may be released in a directory whose name contains the word 'beta' in the directory name and in all filenames contained therein. This directory is provided solely for review purpose and will not be maintained after the end of the beta period.

<TBD: Examples of file names>.

For more information about the versions of the Unicode Standard see [UnicodeVersions].

9.1 File syntax

All files need a (formal) syntax description at least as detailed as namelist.txt has - it's probably useful to set up three or four templates for this now, so that we don't get new formats for each proposed property.

(In all but UnicodeData, names should be strictly optional, and we should have utilities that can strip these fields.)

For UnicodeData we could either do the same, or pop a new file for the new planes with either 5 digits or an understood plane offset. Splitting that file would be nice since it will get truly big in no time.

9.2 Types of files:

[There may be a need to formally discuss 'descriptive files' where small sets of complex exceptional cases are discussed in what amounts to non-machine readable text.]

10. References

[ArabicShaping]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/ArabicShaping.txt>
[Bidirectional Algorithm]
Mark Davis, Unicode Standard Annex #9: The Bidirectional Algorithm, <http://www.unicode.org/unicode/reports/tr9>
[CaseFolding]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt>
[Case Mapping] 
Mark Davis, Unicode Technical Report #21: Case Mapping, <http://www.unicode.org/unicode/reports/tr21>
[Character Mapping Tables]
Mark Davis, Unicode Technical Report #22: Character Mapping Tables, <http://www.unicode.org/unicode/reports/tr22>
[Collation]
Mark Davis, Unicode Technical Report #10: Collation, <http://www.unicode.org/unicode/reports/tr10>
[EastAsianWidth]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt>
[East Asian Width
Asmus Freytag, Unicode Standard Annex #11, East Asian Width, <http://www.unicode.org/unicode/reports/tr11>
[Jamo]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/Jamo.txt>
[LineBreak]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/LineBreak.txt>
[Line Breaking]
Asmus Freytag, Unicode Standard Annex #14: Line Breaking Properties, <http://www.unicode.org/unicode/reports/tr14>
[NamesList]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/NamesList.txt>
[NamesList-Format]
Readme file <ftp://ftp.unicode.org/Public/UNIDATA/NamesList.html>
[SpecialCasing]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt>
[Unicode]
The Unicode Standard, Version 3.0, Addison Wesley Longman, 2000.
[UnicodeCharacterDatabase]
Readme file, <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html>
[UnicodeData]
Data file <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>
[UnicodeData-Format]
Readme file <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html>
[UnicodeVersions]
Versions of the Unicode Standard <http://www.unicode.org/unicode/standard/versions>

Acknowledgements

The authors wish to thank Mark Davis for his insightful comments.

Changes from previous drafts

Changes form second working draft:  None, this is the first initial draft submitted to UTC.

Changes from third working draft: Reordered according to UTC feedback. Added information about maintenance, extended discussion on updates.

Changes from fourth working draft: Reworded the definitions based on feedback from Mark Davis. Some other minor changes.

Changes from fifth working draft: Added a few definitions. Cleaned up some of the text.


Copyright © 2000-2000 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.