[Unicode] Character Proposals Tech Site | Site Map | Search
 

Unicode Properties in Character Proposals

Introduction

Characters in the Unicode Standard have a number of properties. Properties are used to determine character behavior in software. For example, the case property—whether a character is uppercase or lowercase—will affect how the character is used in software that performs capitalization of words in English. A character's properties may identify whether it is a letter, a number, a mark of punctuation, whether it belongs to a script that runs right to left or left to right, and so forth. These properties are used for various computer processes, such as capitalization, searching, spell-checking. If properties are incorrectly identified, text that is pasted into a document may get reversed, the cursor may not work as expected, text may not lay out correctly on a page, all depending on whether the character's properties are correctly identified or not.

Some of these properties are obvious and easily discovered, and some are not. Some properties are automatically assigned (such as Derived Age, which tells when a character was added to the standard), others are assigned with ease, implicit in the character name or other information easily supplied by a proposal author. For general information on character properties, see Chapter 4, Character Properties, in the Unicode Standard.

For reference, a more-or-less complete list of properties can be found online here:

https://www.unicode.org/reports/tr44/#Properties

Property information must be supplied at the time new characters are published in the standard. The following questions and discussion below have been developed to get proposal authors and committee members thinking about this issue. For each character in a proposal, the proposal author should consider the character in context, and answer questions about how the character interacts with other characters.

Basic Information

The most basic information required about characters includes name, code point, and other identity information, such as whether a character goes by more than one name, or can be cross-referenced to another character. This information is included in the names list of a proposal, accompanying glyphs of the proposed characters.

Code points  The code points are typically assigned by the standards committees (WG2 and UTC), but if the proposal is for an entire script, it is probably already on the roadmap, and therefore a particular range of code points may already have been pre-selected. In other cases, those proposing characters can make recommendations about where the characters should be encoded, but it isn't necessary to do so.

Names  If there are alternative names for a character or characters in the proposal, these should also be discussed, as well as other information about the meanings of names, and similarities in behavior to other characters that are already encoded in the standard.

A sample listing of code points and names is the following:

	0391 GREEK CAPITAL LETTER ALPHA
	0392 GREEK CAPITAL LETTER BETA
	0393 GREEK CAPITAL LETTER GAMMA
	0394 GREEK CAPITAL LETTER DELTA

General Category and Other Properties

Each character is assigned a "General Category". The general category should be specified in a separate "Character Properties" section of a proposal.

The general category properties are documented in the Unicode Character Database (UCD). Typical categories include things such as "letter", "combining mark", "symbol" and so forth. The preferred format for listing the character properties is that found in the file https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt. The following line is the UnicodeData.txt entry for Greek uppercase gamma:

0393;GREEK CAPITAL LETTER GAMMA;Lu;0;L;;;;;N;;;;03B3;

One of the easiest ways to provide character properties is to find a similar character that is already encoded, and copy its properties, inserting the appropriate code point and name, and other changes as applicable. The listing of all the characters and their properties is located in the file https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt.

Note: If the character property information is still puzzling, then describe the character's use, answering the questions in the Appendix for each character.

The fields in UnicodeData.txt, separated by semicolons, comprise the following categories (given below with the values from the example above):

Code point:  0393

Name: GREEK CAPITAL LETTER GAMMA

General Category: Lu (for Letter uppercase)

Canonical Combining Class: 0 (this category provides information as to where a given character is placed in relation to another character, for example where a diacritic is placed; gamma is a spacing character, and as such it receives combining class 0)

Bidirectional Class: L (for strong left-to-right directionality of the script)

Decomposition Type/Decomposition Mapping: (left blank as there is no decomposition into other characters)

Numeric Type: (left blank as this is not a number; if it were, the digit value would be included here)

Numeric Value: (left blank as this is not a number)

Bidi Mirrored: (left blank as this character has no mirroring)

Unicode 1 Name: (left blank as there was no Unicode 1.0 name)

ISO Comment: (left blank as there is no ISO comment)

Simple Uppercase Mapping: (left blank since this is already uppercase)

Simple Lowercase Mapping: 03B3 (the code point for GREEK SMALL LETTER GAMMA, the lowercase form that should be mapped to this character)

Simple Titlecase Mapping: (left blank as no Unicode titlecase character for uppercase gamma is encoded)

The above properties in UnicodeData.txt are documented in UAX #44, "Unicode Character Database": https://www.unicode.org/reports/tr44/#Properties

For a useful Excel spreadsheet that shows the Unicode character properties with informational notes, see SIL's "Unicode Character Properties Excel Workbook" at http://scripts.sil.org/ExcelUnicodeData.

Linebreaking

Line breaking behavior affects how lines of text fit into a graphic window. As a window is re-sized to be narrower, the words are made to wrap automatically to the next line. Specific line breaking properties affect how characters behave at the ends and beginnings of lines, as the line ends change. For example, in the expression "$ .01" the dollar sign should stay with the following number when it occurs at the end of a line, even though a space intervenes; $ on one line and .01 on the next wouldn't typically be allowed. Closing punctuation marks such as ")" would typically not be allowed as the first character on a line. Defining "line breaking" for characters used in historic scripts may seem anachronistic, but you will need to consider how a modern edition may lay out an ancient text on a page or in a text window on a computer.

As with the character properties, information on line-breaking should be included in a separate section of a proposal for new characters.

Determining line breaking can be tricky, but many characters simply behave "just like" some other characters. One way to determine the line breaking property is to determine if there is a character already in the standard that behaves similarly, or identically, to the given character in terms of line breaking, and to use the line breaking properties of the already encoded character, as given in UAX #14, Unicode Line Breaking Algorithm.

Another way to determine line breaking is to describe the line breaking properties of the characters, based on responses to the following questions:

  • Can it appear at the end of a line? Beginning of a line?
  • Does it have special or unusual behavior near the ends of lines? If so, describe the special behavior.
  • Does it come between letters and cause them to not be breakable at the end of a line? Or can surrounding characters be broken across the line even when this character is before/after?
  • Is the character a math or technical operator? A "technical operator" would be a character that acts like a math operator but in non-math contexts, for example in a programming language, a grammar, or other semantic notation.
    • If it is a math operator, is it binary or unary, or other?
    • Does it have the "math" property, or not?
    • Does it stretch or change in appearance depending on context (e.g., like summation or integrals)?
  • Does the character belong to any of the special categories, such as hyphen, dash, or diacritic? These categories are special because they are used for determining other kinds of character behavior. A verbal description of how a given character behaves is advised for such special categories.

Collation and Ordering Issues

Characters are often ordered in relation to other characters. For symbols, the default order in which they happen to appear in the standard often doesn't matter very much. However, for characters that are part of an alphabet or syllabary, the default order is often quite important. If you are proposing a whole script, the binary order (the order in which the characters are listed in the standard) of the proposal is often taken as the first approximation of an expected ordering. If there are reasons why the binary order differs from the expected "native" ordering, these should be justified and spelled out in a separate section of the character proposal. Otherwise, the characters in the proposal should simply be laid out in a logical, expected ordering. A simple listing of the characters in the expected order is recommended, such as the following for Kaithi consonants: ka, kha, ga, gha, etc.

If two or more orders occur with some frequency (for example, there might be differences in how characters are ordered depending in the language being sorted), it is helpful to discuss such differences, and include both orders in the text of the proposal.

If you are proposing additional characters in a script that is already encoded, show where the characters should be sorted in relation to the other characters already encoded. For example, if new syllables in the Vai or Yi scripts are to be added, their binary order (where they are encoded) may be very different from where they should occur in the expected native syllabic order. The proposal needs to specify exactly where the characters should be interpolated in sorting.

For historic scripts, particularly those that are still not fully understood, it may be difficult to specify the ordering. In this case, provide your best guess, but it is advisable to rely, if possible, on the order given in available standard handbooks or dictionaries.

Besides the primary and secondary order issues for the letters and digits, the proposal author also needs to provide some information about how "special" characters behave—whether they are simply ignored for collation, or have some special order. Special characters might include symbols, punctuation, and so on. That is often the hardest part of coming up with collation table assignments. It may help to think in terms of whether such symbols might behave like other characters already encoded.

For a technical overview of sorting behavior see the introductory portions of UTS #10, The Unicode Collation Algorithm, especially sections 1.0, 1.1, 1.8, and 1.9.

Use in Identifiers

In a section of the proposal, include a comment on the potential use of a character in identifiers. Identifiers are letters, numbers, or symbols used in domain names (such as "paypal.com") or as variables in programming languages. The questions below can assist you in providing the necessary information for the Unicode Technical Committee.

  • Can the character be used in identifiers? Normally only modern-use letters, marks, and numbers are permitted in identifiers (used, for example, in programming languages, user names, international domain names, etc).
  • Is it a character in customary modern use, e.g. commonly used in newspapers, magazines, and so on in one or more living languages?
  • If the character is not a letter, mark, or number, but is deemed necessary to be in identifiers, provide justifications.
  • If allowable in identifiers, can the character start an identifier, or would it only be used as a non-first character? (Most characters that are allowed in identifiers can be the first character.) Any special handling or considerations should be spelled out.  For example, a part number like"X2b-31c" or model numbers like"325i" are identifiers.

Special Considerations for Bidirectional Text

Bidirectional refers to text such as mixed Hebrew or Arabic and English with parts of the text running in left-to-right and right-to-left directions. In the context of bidirectional text, how do the characters behave? Characters need to have their directionality specified as either L ("Left"), R ("Right"), AL ("Right to Left Arabic"), or neutral; and some discussion of this may be required in the proposal, for each such character.

Note that the directionality of "R" applies to strong directional characters for most Right-to-Left scripts, such as the Hebrew alphabet and related punctuation. The directionality "AL" is a special strong Left-to-Right direction, used only for Arabic, Thaana, and Syriac alphabets and most punctuation specific to those scripts.

Also, special symbols need to be compared to the behavior of other special symbols in bidi, and the directional class of numbers needs to be specified.

Shaping behavior refers to changes in a character's shape based on context, such as whether it appears at the beginning, middle, or end of a word. In Arabic, almost all letters have special requirements for how they appear depending on positional context, and are divided into various shaping classes. If you are working on Arabic or scripts with similarly complex shaping behavior, see UAX #9, The Unicode Bidirectional Algorithm, as well as Chapter 8, Middle Eastern Scripts in the Unicode Standard.

In scripts such as Arabic, the shaping classes and behavior need to be explicitly determined for each such letter:

  • Is there an Arabic letter with similar or identical shaping behavior?
  • Does it belong to an existing shaping class?
  • Would the character normally be mirrored if used in right-to-left text?

Special Considerations for CJK Additions

The addition of CJK ideographs is usually handled by the Ideographic Rapporteur Group (IRG), but in rare cases, a proposal for CJK characters may be presented to the UTC. If the character is a CJK ideograph, it should be assigned properties just like other ideographs, so a whole set of questions are already pre-answered, because it should be assigned most properties identical to all other CJK ideographs. However, there are some other questions:

  • Does it have special numerical significance?
  • Is it some kind of variant of an existing CJK character?

However, CJK characters will also need to have a lot of associated data, as specified in the Unihan documentation. See UAX #38, Unicode Han Database for details.

Appendix

Answering the questions below will provide basic information to allow the Unicode Technical Committee members to determine a character's properties. Provide a description of each character's use, with examples if possible.

A. Some scripts have case, and if so, it will be necessary to know:

  • Is it uppercase, lowercase, or uncased? If uppercase or lowercase, what are the case mappings? (These mappings refer to a property that identifies the other element of a case pair, for example the uppercase mapping of "m" is "M".) Uppercase and titlecase characters must have lowercase mappings.
  • Is it a titlecase digraph?, E.g. the Unicode character U+01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z (which looks like "Dz")
  • Does it have complex or non-standard case mapping behavior? (e.g., Turkish dotless i)

B. Is the character an ordinary letter of an alphabet or syllabary (non-CJK ideograph)? Or is it a stand-alone symbol? (For CJK ideographs, see the special section above.)

C. Is the character a white-space character, or does it cause visible separation between other characters?

D. Does the character have a numeric value? If so, is it a decimal digit, or is it a "digit" of some other non-decimal numbering system?

  • If the character is a true decimal digit (i.e., it forms decimal radix numbers like European numbers), then the General_Category value is Nd and all three numeric fields should have a numeric value filled in (for example, for CHAKMA DIGIT NINE, the General_Category is Nd and 9 is inserted in the three numeric fields: 1113F;CHAKMA DIGIT NINE;Nd;0;L;;9;9;9;N;;;;;)
  • If the character is any other kind of number, even if it has a numeric value from 1 through 9, then the General_Category value is No (or Nl), and only the third numeric field should be filled in (for example, AEGEAN NUMBER NINE is not a decimal radix number, so it is No and 9 appears only in the third numeric field: 1010F;AEGEAN NUMBER NINE;No;0;L;;;;9;N;;;;;).

E. Is it a "base letter" or does it combine with letters or symbols?

F. If it is a combining character:

  • How does it combine? Above? Below? After? Are there particularly strong restrictions on how it is displayed, such as being centered or to the left/right of a base character?
  • Does it bind very tightly to letters, such as some vowel signs do?
  • Is it completely non-spacing, or does it combine but also have spacing characteristics?

G. If this is a punctuation character:

  • Is it terminal punctuation (i.e., ending a clause or a sentence)?
  • Is it paired with anything else? For example, "(" is paired with ")",  "[" is paired with "]".
  • Does it separate words? If so, does it occur exclusively before or after words?
  • Does it occur within words?
  • Does it occur within (as opposed to at the end of) sentences?
  • Can it appear at the end of a line? Beginning of a line?
  • Does it come between letters and cause them to not be breakable at the end of a line?