[Unicode]  Technical Reports
 

Unicode Standard Annex #28

Unicode 3.2

Version Unicode 3.2.0
Authors Members of the Editorial Committee
Date 2002-03-27
This Version http://www.unicode.org/unicode/reports/tr28/tr28-3
Previous Version N/A
Latest Version http://www.unicode.org/unicode/reports/tr28
Tracking Number 3

Summary

This document defines Version 3.2 of the Unicode Standard. 

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Standard Annex. It is a stable document and may be used as reference material or cited as a normative reference from another document.

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published as a separate document. Note that conformance to a version of the Unicode Standard includes conformance to its Unicode Standard Annexes. The version number of a UAX document corresponds to the version number of the Unicode Standard at the last point that the UAX document was updated.

A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.

The References provide related information that is useful in understanding this document. Please mail corrigenda and other comments to the author(s).

Contents


I Description

Unicode 3.2 is a minor version of the Unicode Standard. It overrides certain features of Unicode 3.1, and adds a significant number of coded characters. 

Recommended Citation Format for Unicode 3.2

The Unicode Consortium. The Unicode Standard, Version 3.2.0 is defined by The Unicode Standard, Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode Standard Annex #27: Unicode 3.1 (http://www.unicode.org/reports/tr27/) and by the Unicode Standard Annex #28: Unicode 3.2 (http://www.unicode.org/reports/tr28/).

Formal Definition of Unicode 3.2

The Unicode Standard, Version 3.2.0 is defined by the following list.  The version numbering and the role of each component are explained in Versions of The Unicode Standard. The symbols in the change status column are explained in the key below. A summary of modifications in the Unicode Character Database for this version can be found in UnicodeCharacterDatabase-3.2.0.html, together with a list of which data files contain normative vs. informative data. 

Major Reference
The Unicode Consortium. The Unicode Standard, Version 3.0
Reading, MA, Addison-Wesley Developers Press, 2000. ISBN 0-201-61633-5.
Minor References
UAX #27: Unicode 3.1
UAX #28: Unicode 3.2
Update Reference
n/a
Unicode Standard Annexes
UAX #9: The Bidirectional Algorithm, V3.2.0
UAX #11: East Asian Width, V3.2.0
UAX #13: Unicode Newline Guidelines, V3.2.0
UAX #14: Line Breaking Properties, V3.2.0
UAX #15: Unicode Normalization Forms, V3.2.0
UAX #19: UTF-32, V3.2.0
UAX #21: Case Mappings, V3.2.0
Unicode Character Database
http://www.unicode.org/Public/3.2-Update, or
ftp://www.unicode.org/Public/3.2-Update/
Documentation
T DerivedProperties-3.2.0.html
T Index-3.2.0.txt
T NamesList-3.2.0.html
T PropList-3.2.0.html
T ReadMe-3.2.0.txt
T UnicodeCharacterDatabase-3.2.0.html
T UnicodeData-3.2.0.html
Core Data
D ArabicShaping-3.2.0.txt
D BidiMirroring-3.2.0.txt
D Blocks-3.2.0.txt
D CompositionExclusions-3.2.0.txt
D EastAsianWidth-3.2.0.txt
T Jamo-3.2.0.txt
D LineBreak-3.2.0.txt
D NamesList-3.2.0.txt
N     NormalizationCorrections-3.2.0.txt
N PropertyAliases-3.2.0.txt
N PropertyValueAliases-3.2.0.txt
D PropList-3.2.0.txt
D Scripts-3.2.0.txt
D SpecialCasing-3.2.0.txt
N     StandardizedVariants-3.2.0.html
D UnicodeData-3.2.0.txt
D Unihan-3.2.0.txt (very large file, see Unihan-3.2.0.zip)
Derived Data
D CaseFolding-3.2.0.txt
N DerivedAge-3.2.0.txt
D DerivedCoreProperties-3.2.0.txt
D DerivedNormalizationProps-3.2.0.txt
Extracted Data
D DerivedBidiClass-3.2.0.txt
D DerivedBinaryProperties-3.2.0.txt
D DerivedCombiningClass-3.2.0.txt
D DerivedDecompositionType-3.2.0.txt
D DerivedEastAsianWidth-3.2.0.txt
D DerivedGeneralCategory-3.2.0.txt
D DerivedJoiningGroup-3.2.0.txt
D DerivedJoiningType-3.2.0.txt
D DerivedLineBreak-3.2.0.txt
D DerivedNumericType-3.2.0.txt
D DerivedNumericValues-3.2.0.txt
Conformance Test Data
D    NormalizationTest-3.2.0.txt

Key:

N New in this release
D Data change (possibly also format/text change)
F Data format change (possibly also text change)
T Text annotation change
- Unchanged

The list of contributory data files constituting the Unicode Standard, Version 3.2 can also be found online at Enumerated Versions.

New Character Allocations

The primary feature of Unicode 3.2 is the addition of 1016 new encoded characters. These additions consist of several Philippine scripts, a large collection of mathematical symbols, and small sets of other letters and symbols. 

All of the newly encoded characters in Unicode 3.2 are additions to the Basic Multilingual Plane (BMP). 

Complete introductions to the newly encoded scripts and symbols can be found in Article IV, Block Descriptions, below. 

Additional Features of Unicode 3.2

Unicode 3.2 also features amended contributory data files, to bring the data files up to date against the expanded repertoire of characters. A summary of the revisions to the data files can be found in Article VII, Unicode Character Database Changes

All outstanding errata and corrigenda to the Unicode Standard are included in this specification. Major corrigenda having a bearing on conformance to the standard are listed in Article II, Conformance. Other minor errata are listed in Article VI, Errata

Most notable among the corrigenda to the Standard is a further tightening of the definition of UTF-8, to eliminate irregular UTF-8 and to bring the Unicode specification of UTF-8 more completely into line with other specifications of UTF-8. 

The former UTR #21, Case Mappings has been upgraded in status to a Unicode Standard Annex in Unicode 3.2. This means that UAX #21, Case Mappings is now formally a part of the Unicode Standard.

Conventions Used in this Document

The sections of this document are referred to as “articles” to prevent confusion with references to sections of The Unicode Standard, Version 3.0. In addition, the articles in this document are numbered with Roman numerals, to highlight the distinction. The word “section” always refers to sections of The Unicode Standard, Version 3.0 or to a new section of the standard which is added by this document. Page numbers also refer to The Unicode Standard, Version 3.0.

New or replacement text for the standard is indicated with underlined text, when this new text is a corrigendum of an existing section of the standard.

Deleted text from the standard is indicated with struck-through text.

In instances where entire new sections or subsections are to be added to the standard, as for the block descriptions for newly encoded scripts or symbol sets, new section numbers are provided that interleave reasonably with the existing sections of the published Unicode 3.0 book. And for these added sections, the text is not underlined, since the entire sections are new.

In this document, unambiguous dates of the current common era, such as 1999, are unlabeled. In cases of ambiguity, CE is used. Dates before the common era are labeled with BCE.

II Conformance

3.1 Conformance Requirements (revision)

Elimination of Irregular Sequences 

The definition of transformation formats such as UTF-8 allowed conformant processes to interpret certain sequences called irregular sequences. These irregular sequences are those that would be produced by transforming supplementary code points as if they were a sequence of two surrogate code points.

To tighten the definitions, in Unicode 3.2 such irregular sequences are now illegal.

Note: Some implementations of UTF-8 might still interpret irregular sequences; for those, a separate compatibility encoding scheme, to be distinguished from UTF-8, may be used. See Unicode Technical Report #26, “Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8).” However, CESU-8 is not intended nor recommended as an encoding used for open information exchange.

Terminology to distinguish ill-formed, illegal, and irregular code unit sequences is no longer needed. There are no irregular code unit sequences, and thus all ill-formed code unit sequences are illegal. It is illegal to emit or interpret any ill-formed code unit sequence. Unicode 4.0 will revise the terminology and conformance clauses in light of this. For Unicode 3.2, only the minimal changes required of the text are noted here.

Change C12 in Unicode 3.1 to:

C12 (a) When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed code unit sequences.
(b) When a process interprets data in a Unicode Transformation Format, it shall treat illegal ill-formed code unit sequences as an error condition.
(c) A conformant process shall not interpret illegal ill-formed UTF code unit sequences as characters.
(d) Irregular UTF code unit sequences shall not be used for encoding any other information.

Change the fifth note after C12 in Unicode 3.1 to:

Change Table 3.1B after C12 in Unicode 3.1 by splitting the row U+1000..U+FFFF to exclude the surrogate code points:

Table 3.1B. Legal UTF-8 Byte Sequences
 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F      
U+0080..U+07FF C2..DF 80..BF     
U+0800..U+0FFF E0 A0..BF 80..BF   
U+1000..U+CFFF E1..EC 80..BF 80..BF   
U+D000..U+D7FF ED 80..9F 80..BF   
U+D800..U+DFFF ill-formed
U+E000..U+FFFF EE..EF 80..BF 80..BF   
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF  80..BF

3.6 Decomposition (revision)

The text of D21 is replaced by the following text:

D21 Compatibility decomposable character: a character whose compatibility decomposition is not identical to its canonical decomposition. It may also be known as a compatibility precomposed character or a compatibility composite character.

Add the following new text after D23:

D23a Canonical decomposable character: a character which is not identical to its canonical decomposition. It may also be known as a canonical precomposed character or a canonical composite character.

3.9 Special Character Properties (revision)

Replacing ZWNBSP with Word Joiner

The character U+2060 has been added to the standard to allow unambiguous expression of the word-joining semantics. U+2060 WORD JOINER is now the preferred character to express the word-joining semantics implied by the ZWNBSP. The availability of U+2060 makes it unnecessary to use U+FEFF as a zero-width non-breaking space, allowing U+FEFF to be used solely with the semantic of BOM. For more information, see the subsection on “Word Joiner” in Section 13.2, Layout Controls in this document.

Note: Implementers are strongly encouraged to use word joiner in those circumstances whenever word joining semantics is intended.

Additions to Properties

A number of characters which have special character properties have been added in the Unicode Standard, Version 3.2. To reflect this, the following changes are made to the special character properties listing, on pages 48-50 of The Unicode Standard, Version 3.0:

In the entry for “Line boundary control”, add:

205F MEDIUM MATHEMATICAL SPACE
2060 WORD JOINER

Change the name of the “Joining” entry to “Cursive joining and ligation control”.

Add a new entry called “Grapheme joining” after the renamed entry for “Cursive joining and ligation control” and add to that new entry:

034F COMBINING GRAPHEME JOINER

Add a new entry called “Mathematical expression formatting” after the entry “Bidirectional ordering” and add to that new entry:

2061 FUNCTION APPLICATION
2062 INVISIBLE TIMES
2063 INVISIBLE SEPARATOR

Change the name of the “Alternate formatting” entry to “Deprecated alternate formatting”.

Change the name of the “Syriac abbreviation” entry to “Prefixed format control” and add to that entry:

06DD ARABIC END OF AYAH

Change the name of the “Indic dead-character formation” entry to “Brahmi-derived script dead-character formation” and add to that entry:

1714 TAGALOG SIGN VIRAMA
1734 HANUNOO SIGN PAMUDPOD

Change the name of the “Mongolian variant selectors” entry to “Mongolian variation selectors”.

After the “Mongolian variation selectors” entry add a new entry “Generic variation selectors” and add to that new entry:

FE00 VARIATION SELECTOR-1
FE01 VARIATION SELECTOR-2
FE02 VARIATION SELECTOR-3
FE03 VARIATION SELECTOR-4
FE04 VARIATION SELECTOR-5
FE05 VARIATION SELECTOR-6
FE06 VARIATION SELECTOR-7
FE07 VARIATION SELECTOR-8
FE08 VARIATION SELECTOR-9
FE09 VARIATION SELECTOR-10
FE0A VARIATION SELECTOR-11
FE0B VARIATION SELECTOR-12
FE0C VARIATION SELECTOR-13
FE0D VARIATION SELECTOR-14
FE0E VARIATION SELECTOR-15
FE0F VARIATION SELECTOR-16

Application of Combining Marks

Formally speaking, combining marks apply to the preceding grapheme cluster. In most cases, this is the same as applying to the preceding base character. However, in two circumstances there is a difference:

Hangul Syllables. Where a grapheme cluster contains a Hangul syllable, the combining mark applies to the entire syllable. For example, in the following sequence the grave is applied to the entire Hangul syllable, not just the last jamo:

Enclosing Combining Marks. These marks enclose the entire preceding grapheme cluster. For example, in the following sequence the entire Hangul syllable is circled, not just part of it:

This is also true of grapheme clusters composed of elements linked by a Grapheme_Link or combining grapheme joiner. For example, the entire conjunct is circled in the following sequence:

On the other hand, where elements are linked by a Grapheme_Link or combining grapheme joiner, non-enclosing combining marks only apply to the last base character. For example, in the following sequence the nukta applies to the immediately preceding ddha, not to the entire cluster:

For more information, see the subsection on “Combining Grapheme Joiner” in Section 13.2, Layout Controls in this document.

3.11 Conjoining Jamo Behavior (revision)

The following text replaces the text and tables for this section on pages 52-53 of The Unicode Standard, Version 3.0:

The Unicode Standard contains both a large set of precomposed modern Hangul syllables and a set of conjoining Hangul jamo, which can be used to encode archaic syllable blocks as well as modern syllable blocks. This section describes how to:

For more information, see the “Hangul Syllables” and “Hangul Jamo” subsections in Section 10.4, Hangul in The Unicode Standard, Version 3.0. Hangul syllables are a special case of grapheme clusters.

The jamo characters can be classified into three sets of characters: choseong (leading consonants, or syllable-initial characters), jungseong (vowels, or syllable-peak characters), and jongseong (trailing consonants, or syllable-final characters). In the following discussion, these jamo are abbreviated as L (leading consonant), V (vowel), and T (trailing consonant); syllable breaks are shown by middle dots “·”; non-syllable breaks are shown by “×”, combining marks are shown by M, and non-jamo are shown by X.

In the following discussion, a syllable refers to a sequence of Korean characters that should be grouped into a single cell for display. This is different from a precomposed Hangul syllable, which consists of any of the characters in the range U+AC00..U+D7A3. Note that a syllable may contain a precomposed Hangul syllable plus other characters.

Syllable Boundaries

In rendering, a sequence of jamos is displayed as a series of syllable blocks. The following rules specify how to divide up an arbitrary sequence of jamos (including nonstandard sequences) into these syllable blocks. In these rules, a choseong filler (Lf ) is treated as a choseong character, and a jungseong filler (Vf ) is treated as a jungseong.

The precomposed Hangul syllables are of two types: LV or LVT. In determining the syllable boundaries, the LV behave as if they were a sequence of jamo L V, and the LVT behave as if they were a sequence of jamo L V T.

Within any sequence of characters, a syllable break never occurs between the pairs of characters shown in Table 3-5. In all other cases, there is a syllable break before and after any jamo or precomposed Hangul syllable. Note that like other characters, any combining mark between two conjoining jamos prevents the jamos from forming a syllable.

Table 3-5. Hangul Syllable No-Break Rules

Do Not Break Between Examples
L L, V, or precomposed
Hangul syllable
L × L
L× V
L × LV
L × LVT
V or LV V or T  V × V
V × T
LV × V
LV × T
T or LVT T T × T
LVT × T
Jamo or
precomposed
Hangul syllable
Combining marks L × M
V × M
T × M
LV × M
LVT × M

Note that even in normalization form NFC, a syllable may contain a precomposed Hangul syllable in the middle. An example is “L LVT T”. Each well-formed modern Hangul syllable, however, can be represented in the form L V T? (that is one L, one V and optionally one T), and is a single character in NFC.

For information on the behavior of Hangul compatibility jamo in syllables, see Section 10.4, Hangul in The Unicode Standard, Version 3.0.

Standard Korean Syllables

A standard Korean syllable block is composed of a sequence of one or more L followed by a sequence of one or more V and optionally a sequence of zero or more T. A sequence of nonstandard syllable blocks can be transformed into a sequence of standard Korean syllable blocks by inserting choseong fillers (Lf ) and jungseong fillers (Vf ).

Using regular expression notation, a standard Korean syllable is thus of the form:

L+ V+ T*

The transformation of a string of text into standard Korean syllables is performed by determining the syllable breaks as explained in the subsection on “Syllable Boundaries” earlier in this section, then inserting one or two fillers as necessary to transform each syllable into a standard Korean syllable. Thus:

L ^V → L Vf ^V
^L V → ^L Lf V
^V T → ^V Lf Vf T

where ^X indicates a character that is not X, or the absence of a character.

Examples. In Table 3-6, the first row shows syllable breaks in a standard sequence, the second row shows syllable breaks in a nonstandard sequence, and the third row shows how the sequence in the second row could be transformed into standard form by inserting fillers into each syllable.

Table 3-6. Syllable Break Examples

No. 

Sequence   Sequence with Syllable Breaks Marked

LVTLVLVLVfLfVLfVfT

→  LVT · LV · LV · LVf · LfV · LfVfT

2

LLTTVVTTVVLLVV LL · TT · VVTT · VV · LL · LLVV

3

LLTTVVTTVVLLVV →  LLVf · LfVfTT · LfVVTT · LfVV · LLVf · LLVV

4.2 Combining Classes—Normative (revision)

Remove the entry for U+06DD ARABIC END OF AYAH from Table 4-3, Combining Classes on page 80 of The Unicode Standard, Version 3.0.

Unicode Standard Annex #15, “Unicode Normalization Forms” (revision)

In Corrigendum #3 the canonical mapping for U+F951 has been corrected. For more information, see Unicode Standard Annex #15, “Unicode Normalization Forms”.

III General Structure and Guidelines

2.2 Unicode Design Principles (addition)

Add the following text to page 18 of The Unicode Standard, Version 3.0 just before the subsection on “Convertibility”:

Decompositions

Precomposed characters are formally known as decomposables, because they have decompositions to one or more other characters. There are two types of decompositions:

Thus there are three types of characters, based on their decomposition behavior:

The following figure illustrates these three types. The solid arrows indicate canonical decompositions, and the dotted arrows indicate compatibility decompositions. If an arrow loops back and points to the character itself, that indicates that there is no decomposition of that type (other than in the trivial sense of a character “decomposing” to itself).

The figure illustrates two important things to keep in mind:

For more precise definitions of some of these terms, see Chapter 3, Conformance in The Unicode Standard, Version 3.0.

Nondecomposables

nondecomposable example

Canonical Decomposables

canonical decomposable example

canonical decomposable example

canonical decomposable example

Compatibility Decomposables

compatibility decomposable example

compatibility decomposable example

compatibility decomposable example

5.15 Locating Text Element Boundaries (revision)

Add the following text after bullet item 6 on page 125 of The Unicode Standard, Version 3.0:

The rules are applied in order. That is, there is an implicit “otherwise” at the front of each rule following the first. It is possible to construct alternate sets of such rules that are fully equivalent; that is, they have the same effect.

Note: The rules for default grapheme cluster boundaries, default word boundaries and default sentence boundaries are in the process of being superseded by a new Unicode Technical Report #29, Text Boundaries.

IV Block Descriptions

Note: The numbering used here for block descriptions and revised text follows The Unicode Standard, Version 3.0 for ease of cross-reference.

6.1 General Punctuation (addition)

Invisible Operators. In mathematics some operators or punctuation are often implied, but not displayed. U+2063 INVISIBLE SEPARATOR or invisible comma is intended for use in index expressions and other mathematical notation where two adjacent variables form a list and are not implicitly multiplied. In mathematical notation, commas are not always explicitly present, but need to be indicated for symbolic calculation software to help it disambiguate a sequence from a multiplication. For example, the double ij subscript in the variable aij means ai, j — that is, the i and j are separate indices and not a single variable with the name ij or even the product of i and j. Accordingly to represent the implied list separation in the subscript ij one can insert a nondisplaying invisible separator between the i and the j. In addition, use of the invisible comma would hint to a math layout program to typeset a small space between the variables.

Similarly an expression like mc2 implies that the mass m multiplies the square of the speed c. To represent the implied multiplication in mc2, one inserts a nondisplaying U+2061 INVISIBLE TIMES between the m and the c. A related case is the use of U+2062 FUNCTION APPLICATION for an implied function dependence as in f(x + y). To indicate that this is the function f of the quantity x + y and not the expression fx + fy, one can insert the nondisplaying function application symbol between the f and the left parenthesis. 

Another example is the expression f ij(cos(ab)), which means the same as fij(cos(a×b)), where × represents multiplication, not the cross product. Note that the spacing between characters may also depend on whether the adjacent variables are part of a list or are to be concatenated, that is, multiplied.

A more complete discussion of mathematical notation can be found in Proposed Draft Unicode Technical Report #25, “Unicode Support for Mathematics.”

Commercial Minus. U+2052 COMMERCIAL MINUS SIGN is used in commercial or tax related forms or publications in several European countries, including Germany and Scandinavia. The string “./.” appears to be used as a fallback representation for this character.

The symbol may also appear as a marginal note in letters, denoting enclosures. One variation replaces the top dot with a digit indicating the number of enclosures.

An additional usage of the sign appears in the Finno-Ugric Phonetic Alphabet (FUPA), where it marks a structurally-related borrowed element of different pronunciation. In Finland and a number of other European countries, the dingbats U+2052 and U+2713 are used for “correct” and “incorrect” respectively in marking a student’s paper. This contrasts with American practice, for example, where U+2713 and U+2717 can be used for “correct” and “incorrect” respectively in the same context.

CJK Symbols and Punctuation: U+3000–U+303F (update and addition)

On page 155 of The Unicode Standard, Version 3.0 update the first full paragraph as follows:

This block encodes punctuation marks and symbols primarily used by writing systems that employ Han ideographs. Most of these characters are found in East Asian standards.

Add a new paragraph on page 155 of The Unicode Standard, Version 3.0 to follow the paragraph on U+3006:

U+3008, U+3009 angle brackets are unambiguously wide. The Unicode Standard encodes different characters for use in other contexts, such as mathematics. There are other characters in this block that have the same characteristics, including double angle brackets, tortoise shell brackets, and white square brackets.

7.2 Greek (revision)

Representative Glyphs for Greek Phi

With Unicode 3.0 and the concurrent second edition of ISO/IEC 10646-1, the representative glyphs for U+03C6 GREEK LETTER SMALL PHI and U+03D5 GREEK PHI SYMBOL were swapped. In ordinary Greek text, the character U+03C6 is used exclusively, although this characters has considerably glyphic variation, sometimes represented with a glyph more like the representative glyph shown for U+03C6 (the “loopy” form) and less often with a glyph more like the representative glyph shown for U+03D5 (the “straight” form).

For mathematical and technical use, the straight form of the small phi is an important symbol and needs to be consistently distinguishable from the loopy form. The straight form phi glyph is used as the representative glyph for the symbol phi at U+03D5 to satisfy this distinction.

The reversed assignment of representative glyphs in versions of the Unicode Standard prior to Unicode 3.0 had the problem that the character explicitly identified as the mathematical symbol did not have the straight form of the character that is the preferred glyph for that use. Furthermore, it made it unnecessarily difficult for general purpose fonts supporting ordinary Greek text to also add support for Greek letters used as mathematical symbols. This resulted from the fact that many of those fonts already used the loopy form glyph for U+03C6, as preferred for Greek body text; to support the phi symbol as well, they would have had to disrupt glyph choices already optimized for Greek text.

When mapping symbol sets or SGML entities to the Unicode Standard, it is important to make sure that codes or entities that require the straight form of the phi symbol be mapped to U+03D5 and not to U+03C6. Mapping to the latter should be reserved for codes or entities that represent the small phi as used in ordinary Greek text.

Fonts used primarily for Greek text may use either glyph form for U+03C6, but fonts that also intend to support technical use of the Greek letters should use the loopy form to ensure appropriate contrast with the straight form used for U+03D5.

8.2 Arabic (addition)

End of Ayah. U+06DD ARABIC END OF AYAH graphically encloses a sequence of zero or more digits (of General Category Nd) that follow it in the data stream. The enclosure terminates with any non-digit. For behavior of a similar prefixed formatting control, see the discussion of the Syriac Abbreviation Mark in Section 8.3, Syriac in The Unicode Standard, Version 3.0.

9.15 Khmer (addition)

Characters Whose Use is Discouraged. The use of the following characters is discouraged; they are being considered for possible deprecation in a future version of the Standard. These characters should be avoided in the normal representation of Khmer text:

17A3 KHMER INDEPENDENT VOWEL QAQ
17A4 KHMER INDEPENDENT VOWEL QAA
17B4 KHMER VOWEL INHERENT AQ
17B5 KHMER VOWEL INHERENT AA
17D3 KHMER SIGN BATHAMASAT
17D8 KHMER SIGN BEYYAL

For transliteration of Pali/Sanskrit, U+17A2 KHMER LETTER QA is recommended instead of U+17A3 KHMER INDEPENDENT VOWEL QAQ, and the sequence <U+17A2 KHMER LETTER QA, U+17B6 KHMER VOWEL SIGN AA> is recommended instead of U+17A4 KHMER INDEPENDENT VOWEL QAA.

The use of U+17D3 KHMER SIGN BATHAMASAT is not recommended for representation of Khmer lunar dates; a separate proposal for the full representation of Khmer lunar dates is under development.

U+17D8 KHMER SIGN BEYYAL is not recommended for use in the Khmer word meaning, “etc.”. It should be spelled out with a sequence of signs and letters instead.

Combined Vowels. The Khmer language uses two dependent vowel signs whose Unicode representation consists of a sequence of two code points. These are khmer vowel sign srak om, represented by the sequence <U+17BB KHMER VOWEL SIGN U, U+17C6 KHMER SIGN NIKAHIT> and khmer vowel sign srak aam, represented by the sequence <U+17B6 KHMER VOWEL SIGN AA, U+17C6 KHMER SIGN NIKAHIT>. The nikahit represents the final nasalization of the vowel, shown by the “m” in the transliteration. These dependent vowels are treated as units, for the purposes of enumeration of the “letters” of Khmer, and most importantly for collation. Having these vowels represented by a sequence of two Unicode code points may be unexpected for Khmer implementers. It is important, therefore, to ensure that these sequences are treated as units when implementing Khmer.

Subscript Letters. The Unicode encoding of the Khmer script uses an independent (and invisible) coeng sign to indicate that the following consonant is subscripted, by analogy with the virama model employed for representing conjuncts in Indian scripts. Subscripted independent vowels are encoded in the same manner. This approach uses an artificial coeng sign character which does not exist as a letter or sign in the Khmer script, and therefore departs from the ordinary way that Khmer is conceived of and taught to native Khmer speakers. Consequently, the encoding may not be intuitive to a native user of the Khmer writing system. Ordinarily, the units such as khmer consonant coeng ka are conceived of as independent and unitary subscript letters, rather than as a result of conjunct formation.

To aid Khmer script users, a full listing of all the Khmer subscript letters has been provided in the table, “Additional Khmer Character Names”, together with appropriate names for them which follow preferred Khmer practice. While the Unicode encoding represents both the subscripts and the combined vowel letters with a pair of code points, they must be treated as a unit for most processing purposes. In other words they must function as if they had been encoded as a single character. The combined vowel characters are also included in this list, and should also be treated as a unit in processing.

A full Khmer script chart is also provided, showing all of the Khmer characters preferred for modern Khmer usage, including the subscripts and combined vowels. This chart is better for didactic purposes in representing the Khmer script and its Unicode encoding. By contrast, the main Unicode code chart does not reflect the modern reading rules for Khmer, and thereby can give a misleading picture of the structure of the script.

Khmer Script Chart
Consonants
1780
1780
1781
1781
1782
1782
1783
1783
1784
1784
1785
1785
1786
1786
1787
1787
1788
1788
1789
1789
178A
178A
178B
178B
178C
178C
178D
178D
178E
178E
178F
178F
1790
1790
1791
1791
1792
1792
1793
1793
1794
1794
1795
1795
1796
1796
1797
1797
1798
1798
1799
1799
179A
179A
179B
179B
179C
179C
179D
179D
179E
179E
179F
179F
17A0
17A0
17A1
17A1
17A1
17A2
         
Independent Vowels
17A5
17A5
17A6
17A6
17A7
17A7
17A9
17A9
17AA
17AA
17AB
17AB
17AC
17AC
17AD
17AD
17AE
17AE
17AF
17AF
17B0
17B0
17B1
17B1
17B3
17B3
             
Dependent Vowel Signs
17B6
17B6
17B7
17B7
17B8
17B8
17B9
17B9
17BA
17BA
17BB
17BB
17BC
17BC
17BD
17BD
17BE
17BE
17BF
17BF
17C0
17C0
17C1
17C1
17C2
17C2
17C3
17C3
17C4
17C4
17C5
17C5
17BB 17C6
17BB
17C6
17C6
17C6
17B6 17C6
17B6
17C6
17C7
17C7
Subscript Characters
17D2 1780
17D2
1780
17D2 1781
17D2
1781
17D2 1782
17D2
1782
17D2 1783
17D2
1783
17D2 1784
17D2
1784
17D2 1785
17D2
1785
17D2 1786
17D2
1786
17D2 1787
17D2
1787
17D2 1788
17D2
1788
17D2 1789
17D2
1789
17D2 178A
17D2
178A
17D2 178B
17D2
178B
17D2 178C
17D2
178C
17D2 178D
17D2
178D
17D2 178E
17D2
178E
17D2 178F
17D2
178F
17D2 1790
17D2
1790
17D2 1791
17D2
1791
17D2 1792
17D2
1792
17D2 1793
17D2
1793
17D2 1794
17D2
1794
17D2 1795
17D2
1795
17D2 1796
17D2
1796
17D2 1797
17D2
1797
17D2 1798
17D2
1798
17D2 1799
17D2
1799
17D2 179A
17D2
179A
17D2 179B
17D2
179B
17D2 179C
17D2
179C
17D2 179D
17D2
179D
17D2 179E
17D2
179E
17D2 179F
17D2
179F
17D2 17A0
17D2
17A0
17D2 17A2
17D2
17A2
17D2 17A7
17D2
17A7
17D2 17AB
17D2
17AB
17D2 17AF
17D2
17AF
     
Various Signs
17C8
17C8
17CB
17CB
17CC
17CC
17CD
17CD
17CE
17CE
17CF
17CF
17D0
17D0
17D1
17D1
17D4
17D4
17D5
17D5
17D6
17D6
17D7
17D7
17D9
17D9
17DA
17DA
17DC
17DC
17DB
17DB
17C9
17C9
17CA
17CA
   
Digits
17E0
17E0
17E1
17E1
17E2
17E2
17E3
17E3
17E4
17E4
17E5
17E5
17E6
17E6
17E7
17E7
17E8
17E8
17E9
17E9

 

Additional Khmer Character Names
Glyph Code Name
17BB,17C6 17BB 17C6 khmer vowel sign srak om
17B6,17C6 17B6 17C6 khmer vowel sign srak am
17D2,1780 17D2 1780 khmer consonant sign coeng ka
17D2,1781 17D2 1781 khmer consonant sign coeng kha
17D2,1782 17D2 1782 khmer consonant sign coeng ko
17D2,1783 17D2 1783 khmer consonant sign coeng kho
17D2,1784 17D2 1784 khmer consonant sign coeng ngo
17D2,1785 17D2 1785 khmer consonant sign coeng ca
17D2,1786 17D2 1786 khmer consonant sign coeng cha
17D2,1787 17D2 1787 khmer consonant sign coeng co
17D2,1788 17D2 1788 khmer consonant sign coeng cho
17D2,1789 17D2 1789 khmer consonant sign coeng nyo
17D2,178A 17D2 178A khmer consonant sign coeng da
17D2,178B 17D2 178B khmer consonant sign coeng ttha
17D2,178C 17D2 178C khmer consonant sign coeng do
17D2,178D 17D2 178D khmer consonant sign coeng ttho
17D2,178E 17D2 178E khmer consonant sign coeng na
17D2,178F 17D2 178F khmer consonant sign coeng ta
17D2,1790 17D2 1790 khmer consonant sign coeng tha
17D2,1791 17D2 1791 khmer consonant sign coeng to
17D2,1792 17D2 1792 khmer consonant sign coeng tho
17D2,1793 17D2 1793 khmer consonant sign coeng no
17D2,1794 17D2 1794 khmer consonant sign coeng ba
17D2,1795 17D2 1795 khmer consonant sign coeng pha
17D2,1796 17D2 1796 khmer consonant sign coeng po
17D2,1797 17D2 1797 khmer consonant sign coeng pho
17D2,1798 17D2 1798 khmer consonant sign coeng mo
17D2,1799 17D2 1799 khmer consonant sign coeng yo
17D2,179A 17D2 179A khmer consonant sign coeng ro
17D2,179B 17D2 179B khmer consonant sign coeng lo
17D2,179C 17D2 179C khmer consonant sign coeng vo
17D2,179D 17D2 179D khmer consonant sign coeng sha
17D2,179E 17D2 179E khmer consonant sign coeng ssa
17D2,179F 17D2 179F khmer consonant sign coeng sa
17D2,17A0 17D2 17A0 khmer consonant sign coeng ha
17D2,17A2 17D2 17A2 khmer consonant sign coeng qa
17D2,17A7 17D2 17A7 khmer vowel sign coeng qu
17D2,17AB 17D2 17AB khmer vowel sign coeng ry
17D2,17AF 17D2 17AF khmer vowel sign coeng qe

 

9.16 Philippine Scripts (new section) 

Tagalog: U+1700..U+171F
Hanunóo: U+1720..U+173F
Buhid: U+1740..U+175F
Tagbanwa: U+1760..U+177F

The first of these four scripts, Tagalog, is no longer used, although the other three, Hanunóo, Buhid, and Tagbanwa, are living scripts of the Philippines. South Indian scripts of the Pallava dynasty made their way to the Philippines, although the exact route is uncertain. They may have been transported by way of the Kavi scripts of Western Java between the 10th and 14th centuries CE. 

There are written accounts of the Tagalog script by Spanish missionaries, and documents in Tagalog dating from the mid-1500s. The first book in this script was printed in Manila in 1593. While the Tagalog script was used to write Tagalog, Bisaya, Ilocano, and other languages, it fell out of normal use by the mid-1700s; modern Tagalog language is now written in the Latin script. 

The three living scripts, Hanunóo, Buhid, and Tagbanwa, are related to Tagalog, but may not be directly descended from it. The Hanunóo and the Buhid peoples live in Mindoro, while the Tagbanwa live in Palawan. Hanunóo enjoys the most use; it is widely used to write love poetry, a popular pastime among the Hanunóo. Tagbanwa is less used.

Principles of the Scripts

The Philippine scripts share features with the other Brahmi-derived scripts to which they are related.

Consonant Letters. Philippine scripts have consonants containing an inherent -a vowel, which may be modified by the addition of vowel signs or canceled (killed) by the use of a virama-type mark.

Independent Vowel Letters. Philippine scripts have null consonants which are used to write syllables that start with a vowel.

Dependent Vowel Signs. The vowel -i is written with a mark above the associated consonant, and the vowel -u with an identical mark below. The mark is known in Tagalog as kudlit “diacritic,” tuldik “accent,” or tildok “dot,” and ulitan “diacritic” in Tagbanwa. The Philippine scripts employ only the two vowel signs i and u, which are also used to stand for the vowels e and o respectively.

Virama. Though all languages normally written with the Philippine scripts have syllables ending in consonants, not all of the scripts have a mechanism for expressing the canceled -a. As a result, in those orthographies, the final consonants are unexpressed. Francisco Lopez introduced a cross-shaped virama in his 1620 catechism in the Ilocano language, but this innovation did not seem to find favor with native users, who seem to have considered the script adequate without it (they preferred image for kakapi kakapi to image for kakampi kakampi). A similar reform for the Hanunóo script seems to have been better received. The Hanunóo pamudpod was devised by Antoon Postma, who went to the Philippines from the Netherlands in the mid-1950s. In traditional orthography, image for si apu ba upada si apu ba upada is, with the pamudpod, rendered more accurately as image for si aypud bay upadan si aypud bay upadan; the Hanunóo pronunciation is si aypod bay upadan. The Tagalog virama and Hanunóo pamudpod cancel only the inherent -a. No conjunct consonants are employed in the Philippine scripts.

Directionality. The Philippine scripts are read from left to right in horizontal lines running from top to bottom. They may be written or carved either in that manner, or in vertical lines running from bottom to top, moving from left to right. In the latter case, the letters are written sideways so they may be read horizontally. This method of writing is probably due to the medium and writing implements used. Text is often scratched with a sharp instrument onto beaten strips of bamboo which are held pointing away from the body and worked from the proximal to distal ends, in columns from left to right.

Rendering. In Tagalog and Tagbanwa, the vowel signs simply rest over or under the consonants. In Hanunóo and Buhid, however, special ligatures are often formed as shown in the following tables.

Hanunóo

Buhid

Table for Hanunoo Table for Buhid

Punctuation. Punctuation has been unified for the Philippine scripts. In the Hanunóo block, U+1735 PHILIPPINE SINGLE PUNCTUATION and U+1736 PHILIPPINE DOUBLE PUNCTUATION are encoded. Tagalog makes use only of the latter; Hanunóo, Buhid, and Tagbanwa make use of both of them.

10.1 Han (addition)

CJK Compatibility Ideographs (addition) 

Unicode 3.2 adds 59 new ideographs to the Compatibility Ideographs block. These new compatibility ideographs are found from U+FA30 to U+FA6A. They are included in the Unicode Standard to provide full round-trip compatibility with the ideographic repertoire of JIS X 0213:2000 and should not be used for any other purpose.

10.3 Katakana (addition)

Katakana Phonetic Extensions (addition) 

Katakana Phonetic Extensions: U+31F0..U+31FF

These extensions to the Katakana syllabary are all “small” variants. They are used in Japan for phonetic transcription of Ainu and other languages.

10.4 Hangul (addition) 

Hangul Compatibility Jamo

When Hangul compatibility jamo are transformed with a compatibility normalization form, NFKD or NFKC, the characters are converted to the corresponding conjoining jamo characters. Where the characters are intended to remain in separate syllables after such transformation, they may require separation from adjacent characters. This can be done by inserting any non-Korean character.

For example, the table below illustrates how two Hangul compatibility jamo can be separated in display, even after transforming with NFKD or NFKC.

Separating Jamo Characters
Original  NFKD  NFKC Display
U+3131
3131
U+314F
314F
U+1100
1100
U+1161
1161
U+AC00
AC00
Glyph for U+AC00
U+3131
3131
U+200B
200B
U+314F
314F
U+1100
1100
U+200B
200B
U+1161
1161
U+1100
1100
U+200B
200B
U+1161
1161
Glyph for U+3131Glyph for U+314F


11.4 Mongolian (addition)

Standardized Variants of Mongolian Characters (addition) 

Like Arabic letters, Mongolian letters have various presentation forms depending on their positions in words. There are additional linguistic constraints that result in variations that must be employed in specific contexts, creating the need for several Mongolian-specific variant selectors, which are encoded at U+180B, U+180C, and U+180D.

The table of standardized variants in the Unicode Character Database found at http://www.unicode.org/Public/3.2-Update/StandardizedVariants-3.2.0.html provides a description of the variant appearances corresponding to the use of appropriate variation selectors with all allowed base Mongolian characters. Only some presentation forms of the base Mongolian characters used with the Mongolian free variation selectors produce variant appearances. These combinations are exhaustively listed and described in the table. All combinations not listed in the table are unspecified and are reserved for future standardization; no conformant process may interpret them as standardized variants.

For more information, see Section 13.7, Variation Selectors, later in this document.

12.4 Mathematical Operators (additions)

In addition to the symbols in these blocks, mathematical and scientific notation makes frequent use of arrows, punctuation characters, letterlike symbols, geometrical shapes and other miscellaneous and technical symbols. For additional information on all the mathematical operators and other symbols, see Proposed Draft Unicode Technical Report #25, “Unicode Support for Mathematics.”

Other symbols used in mathematical and scientific notation can be found in the Geometric Shapes block. For an extensive discussion of mathematical alphanumeric symbols, see Section 12.2, Letterlike Symbols in The Unicode Standard, Version 3.0. For additional information on all the mathematical operators and other symbols, see Proposed Draft Unicode Technical Report #25, “Unicode Support for Mathematics.”

Supplements to Mathematical Operators and Arrows

The Unicode Standard defines a number of additional blocks to supplement the repertoire of mathematical operators and arrows. These additions are intended to extend the Unicode repertoire sufficiently to cover the needs of such applications as MathML, modern mathematical formula editing and presentation software, and symbolic algebra systems.

Standards. MathML, an XML application, is intended to support the full legacy collection of the ISO mathematical entity sets. Accordingly, the repertoire of mathematical symbols for the Unicode Standard has been supplemented by the full list of mathematical entity sets in ISO TR 9573-13, Public entity sets for mathematics and science. Additional repertoire was provided from the amalgamated collection of the STIX Project (Scientific and Technical Information Exchange). That collection includes, but is not limited to, symbols gleaned from mathematical publications by experts of the American Mathematical Society and symbol sets provided by Elsevier Publishing and by the American Physical Society.

Semantics. The same mathematical symbol may have different meanings in different subdisciplines or different contexts. The Unicode Standard only encodes a single character for a single symbolic form. For example, the “+” symbol normally denotes addition in a mathematical context, but might refer to concatenation in a computer science context dealing with strings, or incrementation, or have any number of other functions in given contexts. It is up to the application to distinguish such meanings according to the appropriate context. Where information is available about the usage (or usages) of particular symbols, it has been indicated in the character annotations in Chapter 14, Code Charts in The Unicode Standard, Version 3.0.

Supplemental Mathematical Operators: U+2A00–U+2AFF

This block contains many additional symbols to supplement the collection of mathematical operators.

Miscellaneous Mathematical Symbols-A: U+27C0–U+27EF

This block contains symbols used mostly as operators or delimiters in mathematical notation.

Mathematical Brackets. The mathematical white square brackets, angle brackets, and double angle brackets encoded at U+27E6..U+27EB are intended for ordinary mathematical use of these particular bracket types. They are unambiguously narrow, for use in mathematical and scientific notation, and should be distinguished from the corresponding wide forms of white square brackets, angle brackets, and double angle brackets used in CJK typography. (See the CJK Symbols and Punctuation block.) Note especially that the “bra” and “ket” angle brackets, U+2329 LEFT-POINTING ANGLE BRACKET and U+232A RIGHT-POINTING ANGLE BRACKET, are now deprecated for use with mathematics because of their canonical equivalence to CJK angle brackets, which is likely to result in unintended spacing problems if used in mathematical formulae.

Miscellaneous Mathematical Symbols-B: U+2980–U+29FF

This block contains miscellaneous symbols used for mathematical notation, including fences and other delimiters. Some of the symbols in this block may also be used as operators in some contexts.

Wiggly Fence. U+29DB LEFT WIGGLY FENCE has a superficial similarity to U+FE34 PRESENTATION FORM FOR VERTICAL LOW LINE. The latter is a wiggly sidebar character, intended for legacy support as an style of underlining character in a vertical text layout context; it has a compatibility mapping to U+005F LOW LINE. This represents a very different usage from the standard use of fence characters in mathematical notation.

Supplemental Arrows-A: U+27F0–U+27FF

This block contains a small additional set of arrows to supplement the main set in the Arrows block.

Long Arrows. The long arrows encoded in the range U+27F5..U+27FF map to standard SGML entity sets supported by MathML. Long arrows represent distinct semantics from their short counterparts, rather than mere stylistic glyph differences. For example, the shorter forms of arrows are often used in connection with limits, whereas the longer ones are associated with mappings. The use of the long arrows is so common that they were assigned entity names in the ISOAMSA entity set, one of the suite of mathematical symbol entity sets covered by the Unicode Standard.

Supplemental Arrows-B:U+2900–U+297F

This block contains a large additional repertoire of arrows to round out the main set in the Arrows block.

12.5 Technical Symbols (additions)

Miscellaneous Technical: U+2300-U+23FF (additions)

Keytop Labels. [to precede “Crops and Quine Corners”] Where possible, keytop labels have been unified with other symbols of like appearance, for example U+21E7 UPWARDS WHITE ARROW to indicate the shift key. While symbols such as U+2318 PLACE OF INTEREST SIGN and U+2388 HELM SYMBOL are generic symbols that have been adapted to use on keytops, other symbols specifically follow ISO/IEC 9995-7.

Angle Brackets. [to follow “Crops and Quine Corners”] U+2329 LEFT-POINTING ANGLE BRACKET and U+232A RIGHT-POINTING ANGLE BRACKET have long been canonically equivalent to the CJK punctuation characters, U+3008 LEFT ANGLE BRACKET and U+3009 RIGHT ANGLE BRACKET, respectively. This canonical equivalence implies that the use of the latter (CJK) code points is preferred, and that U+2329 and U+232A are also “wide” characters. (See Unicode Standard Annex #11, “East Asian Width, for the definition of the East Asian wide property.) Because of this fact, the use of U+2329 and U+232A is deprecated for mathematics and technical publication, where the wide property of the characters has the potential for interfering with proper formatting of mathematical formulae. Instead, use the angle brackets specifically provided for mathematics: U+27E8 MATHEMATICAL LEFT ANGLE BRACKET and U+27E9 MATHEMATICAL RIGHT ANGLE BRACKET. See Section 12.4, Mathematical Operators earlier in this document.

Symbol Pieces. [to follow “APL Functional Symbols”] The characters in the range U+239B..U+23B3, plus U+23B7, comprise a set of bracket and other symbol fragments for use in mathematical typesetting. These pieces originated in older font standards, but have been used in past mathematical processing as characters in their own right to make up extra-tall glyphs for enclosing multi-line mathematical formulae. Mathematical fences are ordinarily sized to the content that they enclose. However, in creating a large fence, the glyph is not scaled proportionally; in particular the displayed stem weights must remain compatible with the accompanying smaller characters. Thus, simple scaling of font outlines cannot be used to create tall brackets. Instead, a common technique is to build up the symbol from pieces. In particular, the characters U+239B LEFT PARENTHESIS UPPER HOOK through U+23B3 SUMMATION BOTTOM represent a set of glyph pieces for building up large versions of the fences (, ), [, ], {, and }, and of the large operators ∑ and ∫. These brace and operator pieces are compatibility characters. They should not be used in stored mathematical text, but are often used in the data stream created by display and print drivers.

The following table shows which pieces are intended to be used together to create specific symbols.

Use of Symbol Pieces

  2-row 3-row 5-row
Summation 23B2, 23B3    
Integral 2320, 2321 2320, 23AE, 2321 2320, 3×23AE, 2321
Left Parenthesis 239B, 239D 239B, 239D 239B, 3×239C, 239D
Right Parenthesis 239E, 23A0 239E, 239F, 23A0 239E, 3×239F, 23A0
Left Bracket  23A1, 23A3 23A1, 23A2, 23A4  23A1, 3×23A2, 23A3
Right Bracket    23A4, 23A6 23A4, 23A5, 23A6

23A4, 3×23A5, 23A6

Left Brace 23B0, 23B1 23A7, 23A8, 2389 23A7, 23AA, 23A8, 23AA, 2389
Right Brace    23B1, 23B0 23AB, 23AC, 23AD 23AB, 23AA, 23AC, 23AA, 23AD

For example, an instance of U+239B can be positioned relative to instances of U+239C and U+239D to form an extra-tall (three or more line) left parenthesis. The center sections encoded here are meant to be used only with the top and bottom pieces encoded adjacent to them because  the segments are usually graphically constructed within the fonts so that they match perfectly when positioned at the same x coordinates.

Vertical Square Brackets. The vertical square brackets, U+23B4 TOP SQUARE BRACKET and U+23B5 BOTTOM SQUARE BRACKET, are compatibility characters for legacy applications emulating certain terminals. They are intended for those terminal applications only, for limited use in vertically-oriented bracketed expressions. U+23B6 BOTTOM SQUARE BRACKET OVER TOP SQUARE BRACKET is used when a single character cell is both the end of one such expression and the start of another. These compatibility characters should not be confused with the general need for rotated glyphs for parentheses, brackets, braces, and quotation marks for vertically rendered CJK text. Such rotations should be handled by fonts and rendering software, rather than by separate encoding of each rotated glyph as a character. See further discussion in Section 6.1, General Punctuation in The Unicode Standard, Version 3.0.

Terminal Graphics Characters. In addition to the box-drawing characters in the Box Drawing block, a small number of additional vertical or horizontal line characters are encoded in the Miscellaneous Technical symbols block to complete the set of compatibility characters needed for applications which need to emulate various old terminals. The horizontal scan line characters, U+23BA HORIZONTAL SCAN LINE-1 through U+23BD HORIZONTAL SCAN LINE-9, in particular, represent characters that were encoded in character ROM for use with 9-line character graphic cells. Horizontal scan line characters are encoded for scan lines 1, 3, 7, and 9. The horizontal scan line character for scan line 5 is unified with U+2500 BOX DRAWINGS LIGHT HORIZONTAL.

Dental Symbols. The set of symbols from U+23BE to U+23CC form a set of symbols from JIS X0213 for use in dental notation.

Standards. This block contains a large number of symbols from ISO/IEC 9995-7:1994, Information technology—Keyboard layouts for text and office systems—Part 7: Symbols used to represent functions.

12.7 Miscellaneous Symbols and Dingbats (new subsection, revision and addition)

Recycling Symbols (new subsection in Miscellaneous Symbols: U+2600-U+26FF)

Plastic Bottle Material Code System. The seven numbered logos encoded from U+2673 to U+2679 images for U+2673 to U+2679 are from “The Plastic Bottle Material Code System,” introduced in 1988 by the Society of the Plastics Industry (SPI) (see http://www.socplas.org). This set consistently uses thin, two-dimensional curved arrows suitable for use in plastics molding. In actual use, the symbols often are combined with an abbreviation of the material class below the triangle. Such abbreviations are not universal, therefore they are not present in the representative glyphs in Chapter 14, Code Charts in The Unicode Standard, Version 3.0.

Recycling Symbol for Generic Materials. An unnumbered plastic resin code symbol U+267A U+267A RECYCLING SYMBOL FOR GENERIC MATERIALS is not formally part of the SPI system, but is found in many fonts. Occasional use of this symbol as a generic materials code symbol can be found in the field, usually with a text legend below, but sometimes also surrounding (or overlaid by) other text or symbols. Sometimes, the UNIVERSAL RECYCLING SYMBOL is substituted for the generic symbol in this context.

Universal Recycling Symbol. Unicode encodes two common glyph variants of this symbol, U+2672 U+2672 UNIVERSAL RECYCLING SYMBOL and U+267B U+267B BLACK UNIVERSAL RECYCLING SYMBOL. Both are used to indicate that the material is recyclable. The white form is the traditional version of the symbol, but the black form is sometimes substituted, presumably because the thin outlines of the white form do not always reproduce well.

Paper Recycling Symbols. The two paper recycling symbols U+267C U+267C RECYCLED PAPER SYMBOL and U+267D U+267D PARTIALLY-RECYCLED PAPER SYMBOL can be used to distinguish fully and partially recycled fiber content in paper products or packaging. They are usually accompanied by additional text.

Dingbats: U+2700-U+27BF (revision) 

The following text replaces the text on Dingbats on pages 305-306 of The Unicode Standard, Version 3.0:

The Dingbats are derived from a well-established set of glyphs, the ITC Zapf Dingbats series 100, which comprises the industry standard “Zapf Dingbat” font currently available in most laser printers. Other series of dingbat glyphs also exist, but are not encoded in the Unicode Standard because they are not widely implemented in existing hardware and software as character-encoded fonts. The order of the Dingbats block basically follows the PostScript encoding.

Unifications. Where a dingbat from the ITC Zapf Dingbats series 100 could be unified with a generic symbol widely used in other contexts, only the generic symbol was encoded. This accounts for the encoding gaps in the Dingbats block. Examples of such unifications include card suits, BLACK STAR, BLACK TELEPHONE, and BLACK RIGHT-POINTING INDEX (see “Miscellaneous Symbols”); BLACK CIRCLE and BLACK SQUARE (see “Geometric Shapes”); white encircled numbers 1 to 10 (see “Enclosed Alphanumerics”); and several generic arrows (see “Arrows”). Those four entries appear elsewhere in this section.

In other instances, other glyphs from the ITC Zapf Dingbats series 100 glyphs have come to be recognized as having applicability as generic symbols, despite having originally been encoded in the Dingbats block. For example, the series of negative (black) circled numbers 1 to 10 are now treated as generic symbols for this sequence, the continuation of which can be found in “Enclosed Alphanumerics”. Other examples include U+2708 AIRPLANE and U+2709 ENVELOPE, which have definite semantics independent of the specific glyph shape, and which therefore should be considered generic symbols, rather than as symbols representing only the Zapf Dingbat glyph shapes.

For many of the remaining characters in the Dingbat block, their semantic value is primarily their shape; unlike characters that represent letters from a script, there is no well-established range of typeface variations for a dingbat that will retain its identity and therefore its semantics. It would be incorrect to arbitrarily replace U+279D TRIANGLE-HEADED RIGHTWARDS ARROW with any other right arrow dingbat or with any of the generic arrows from the Arrows block (U+2190..U+21FF). But exact shape retention for the glyphs is not always required in order to maintain the relevant distinctions. For example, ornamental characters such as U+2741 EIGHT PETALLED OUTLINE BLACK FLORETTE have been successfully implemented in font faces other than Zapf Dingbats with glyph shapes which are similar, but not identical to the ITC Zapf Dingbats series 100.

The following guidelines are provided for font developers wishing to support this block of characters. Characters showing large sets of contrastive glyph shapes in the Dingbats block, and in particular the various arrow shapes at U+2794..U+27BE, should have glyphs that are closely modeled on the ITC Zapf Dingbats series 100, which are shown as representative glyphs in the code charts. The same applies to the various stars, asterisks, and snowflakes, drop-shadowed squares, checkmarks, and x’s, many of which are ornamental, and have an elaborate name describing their glyph.

Where the above does not apply, or where dingbats have more generic applicability as a symbol, their glyphs do not need not to match the representative glyphs in the code charts in every detail.

Ornamental Brackets (addition to Dingbats: U+2700-U+27BF)

Ornamental Brackets. The 14 ornamental brackets encoded at U+2768..U+2775 are a late addition to the set of Zapf Dingbats encoded in the Unicode Standard. Although they have always been included in Zapf Dingbats fonts, they were unencoded in