[Unicode]  Technical Reports
 

(2nd) Proposed Update Unicode Technical Report #25

Unicode Support for Mathematics

Authors Barbara Beeton (bnb@ams.org), Asmus Freytag (asmus@unicode.org), Murray Sargent III (murrays@microsoft.com)
Date 2007-2-1
This Version http://www.unicode.org/reports/tr25/tr25-8.html
Previous Version http://www.unicode.org/reports/tr25/tr25-6.html
Latest Version http://www.unicode.org/reports/tr25
Revision 8

Summary

The Unicode Standard includes virtually all of the standard characters used in mathematics. This set supports a wide variety of math usage on computers, including in document presentation languages like TeX, in math markup languages like MathML and OpenMath, in internal representations of mathematics for applications like Mathematica, Maple, and MathCAD, in computer programs, and in plain text. This technical report describes the Unicode mathematics character groups and gives some of their imputed default math properties.

NOTE TO REVIEWERS:

Significant changes to the text are marked. Extensive copy editing was applied to this document compared to the latest published version, but most of those text changes have not been marked in order to keep the text readable.

A number of sections are marked [proposed]. The issues addressed in these sections have proposed solutions submitted separately to the UTC. The text in these proposed sections as written would apply if these proposals are adopted. The plan is to update these sections to match the outcome of the disposition of such proposals. The authors feel that capturing the ongoing work in these instances is beneficial to the user community in order to provide more meaningful reviews. Where pending character proposals are mentioned, code points are not given as they are tentative until characters are finally encoded.

Status

This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium.  This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents

  1. Overview
  2. Mathematical Character Repertoire
  3. Mathematical Character Properties
  4. Implementation Guidelines
  5. Data Files
  6. Security Considerations

1 Overview

All of science and technology uses formulas, equations, and mathematical notation as part of the language of the subject. This report presents a discussion of the mathematics character repertoire of the Unicode Standard [Unicode] as used for mathematics, but this discussion is intended apply to mathematical notation in general. 

Mathematical documents using the Arabic script use additional conventions, in particular when typesetting mathematics from right to left. Such conventions are not documented here. This report also does not discuss mathematical symbols of purely historical or local interest, such as symbols found in ancient mathematical texts or digits used in script specific systems for writing numeric quantities.

As described in the Unicode Character Property Model [PropMod], each Unicode character has associated character properties. This report describes the properties relevant to the mathematics character repertoire, including a number of properties that are not yet part of the Unicode Standard, and details character classifications by usage and by typography. In addition, this report gives some implementation guidelines for input methods and use of Unicode math characters in programming languages.

Some of the text of the character block descriptions in the Unicode Standard was based on early drafts of this report; as a result there is significant overlap, although the focus of the presentation is different. As always, wherever there is a discrepancy, the text of the Standard has precedence.

The notational conventions follow the use in [Unicode]. Due to limitations of the plain HTML format of this report, examples of mathematical formulas are shown in larger size than would be typical for a mathematical paper, and their layout, spacing and vertical alignment are merely approximations of the correct appearance.

2 Mathematical Character Repertoire

The Unicode Standard provides a quite complete set of standard math characters to support  publication of mathematics on and off the web. The early versions of Unicode, through version 3.0 already included over three hundred math-specific symbols. Unicode 3.1 introduced almost a thousand new alphanumeric symbols, and Unicode 3.2 introduced six hundred new characters for operators, arrows, and delimiters for a total of around 2000 mathematical symbols. The more limited additions to the repertoire in the versions since then have filled some gaps in coverage, in particular for mapping existing ISO entity sets for publishing [ISO9573].

The repertoire of mathematical characters in [Unicode] is the result of input from many sources, notably from the STIX Project (Scientific and Technical Information Exchange) [STIX], a collaborative project of scientific and technical publishers. The STIX collection includes, but is not limited to, symbols gleaned from mathematical publications by experts from the American Mathematical Society (AMS), and symbol sets provided by Elsevier Publishing and by the American Physical Society. This repertoire enables the display of virtually all standard mathematical symbols. Nevertheless no collection of mathematical symbols can ever be considered complete; mathematicians and other scientists are continually inventing new mathematical symbols, which will be considered for addition as they become widely accepted in the scientific communities.

Mathematical Markup Language (MathML™) [MathML], an XML application [XML], is a major beneficiary of the increased repertoire for mathematical symbols. The W3C Math Working Group, which developed MathML, lobbied in favor of the inclusion of the new characters. In addition, the new characters lend themselves to direct plain text encoding of mathematics for various purposes which can be much more compact than MathML or TE X, the typesetting language and program designed by Donald Knuth [TeX] (see Section 4, Implementation Guidelines).

2.1 Mathematical Alphanumeric Symbols Block

The Mathematical Alphanumeric Symbols block (U+1D400—U+1D7FF) contains a large collection of letterlike symbols for use in mathematical notation, typically for variables. The characters in this block are intended for use only in mathematical or technical notation; they are not intended for use in non-technical text. When used with markup languages, for example with MathML the characters are expected to be used directly, instead of indirectly via entity references or by composing them from base letters and style markup.

Words Used as Variables. In some specialties, whole words are used as variables, not just single letters. For these cases, style markup is preferred because the juxtaposition of variables generally implies multiplication, or some other composition, in ordinary mathematical notation, not word formation as in ordinary text. Markup not only provides the necessary scoping in these cases, it also allows the use of a more extended alphabet.

2.2  Mathematical Alphabets

Basic Set of Alphanumeric Characters. Mathematical notation uses a basic set of mathematical alphanumeric characters which consists of:

For some characters in the basic set of Greek characters, two variants of the same character are included. This is because they can appear in the same mathematical document with different meanings, even though they would have the same meaning in Greek text.

Mathematical Accents. The diacritics, or accents, in mathematical text usually have special semantic significance different from that of changing the pronunciation of a letter, as is the case for text accents. Because the use of text accents such as the acute accent would interfere with common mathematical diacritics, only unaccented forms of the letters are used for mathematical notation. Examples of common mathematical diacritics that can be confused with text accents are the circumflex, macron, or the single or double dot above, the latter two of which are commonly used in physics to denote derivatives with respect to the time variable. 

Mathematical symbols with diacritics are always represented by combining character sequences, except as required by normalization. See Unicode Standard Annex #15, “Unicode Normalization Forms” [Normalization] for more information. Note that normalization leaves all characters in the Mathematical Alphanumeric Symbols and Letterlike Symbols blocks unaffected. These blocks contain nearly all alphabetic characters used as math symbols.

Additional Characters. In addition to this basic set, mathematical notation also uses the bold upper- and lowercase digamma (U+1D7CA and U+1D7CB), and the four Hebrew-derived characters (U+2135..U+2138), for example in ℵ0 for the first transfinite cardinal. Occasional uses of other alphabetic and numeric characters are known. Examples include U+0428 Ш cyrillic capital letter sha, U+306E の hiragana letter no, the ideograph U+4E2D 中 and Eastern Arabic-Indic digits (U+06F0..U+06F9). However, unlike the characters in the mathematical alphabets, these characters are only used in a single, basic form.

Dotless Characters. In Unicode, the characters "i" and "j", including their variations in the mathematical alphabets have the Soft_Dotted property. Any conformant renderer will remove the dot when the character is followed by a nonspacing combining mark above. Therefore using an individual mathematical italic i or j with math accents would result in the intended display. However, in mathematical equations an entire sub-expression can be placed underneath a math accent, for example, when a 'wide hat' is placed on top of i + j, as in this example shown together with the corresponding [TeX] notation:

wide hat example

$\widehat{\imath + \jmath} = \hat{\imath} + \hat{\jmath}.$

Whenever a mathematical accent applies to an entire subexpression, a renderer can no longer rely simply on the presence of an adjacent combining character to substitute the un-dotted glyph; whether the dots should be removed in such a situation is no longer predictable. In TE X, this decision is left to the author, and some authors would want to use the dotted forms as in $\widehat{i + j}$.

In some documents mathematical italic dotless i or j are used explicitly without any combining marks, or even in contrast to the dotted versions. Therefore, the Unicode Standard provides the explicitly dotless characters U+1D6A4 MATHEMATICAL ITALIC DOTLESS I and U+1D6A5 MATHEMATICAL ITALIC DOTLESS J. They map to the ISOAMSO entities imath and jmath or the [TeX] macros \imath and \jmath which by default are always italic. Their appearance in the code charts is similar to the shapes documented in the ISO 9573-13 entity sets and used by TE X. They do not form case pairs.

Where a math accent is immediately applied to these entities, as in $\hat{\imath } + \hat{\jmath}$, they could be mapped to mathematical italic i or j when converting to Unicode, but making general substitutions could result in an unintended appearance or a change to the document.

Semantic Distinctions. Mathematical notation requires a number of Latin and Greek alphabets that initially appear to be mere font variations of one another. For example, the letter H can appear as plain or upright (H), bold (H), italic (H ), and script (H). However, in any given document, these characters have distinct, and usually unrelated mathematical semantics. For example, a normal H represents a different variable from a bold H, etc. If these attributes are dropped in plain text, the distinctions are lost and the meaning of the text is altered. Without the distinctions, the well-known Hamiltonian formula:

Hamiltonian formula,

turns into the integral equation in the variable H:

integral equation in H.

Mathematicians will object that a properly formatted integral equation requires all the letters in this example (except for the "d") to be in italics. However, because the distinction between  H  and H has been lost, they would recognize it as a fallback representation of an integral equation, and not as a fallback representation of the Hamiltonian. By encoding a separate set of alphabets, it is possible to preserve such distinctions in plain text.

Mathematical Alphabets. The alphanumeric symbols encountered in mathematics are given in the following table:

Table 2.1 Mathematical Alphabets

Math Style

Characters from Basic Set

Location

plain (upright, serifed)

Latin, Greek and digits

BMP

bold

Latin, Greek and digits

Plane 1

italic

Latin and Greek

Plane 1*

bold italic

Latin and Greek

Plane 1

script (calligraphic)

Latin

Plane 1*

bold script (calligraphic)

Latin

Plane 1

Fraktur

Latin

Plane 1*

bold Fraktur

Latin

Plane 1

double-struck

Latin and digits

Plane 1*

sans-serif

Latin and digits

Plane 1

sans-serif bold

Latin, Greek and digits

Plane 1

sans-serif italic

Latin

Plane 1

sans-serif bold italic

Latin and Greek

Plane 1

monospace

Latin and digits

Plane 1

* Some of these alphabets have characters in the BMP as noted in the following section.

The plain letters have been unified with the existing characters in the Basic Latin and Greek blocks. There are 24 double-struck, italic, Fraktur and script characters that already exist in the Letterlike Symbols block (U+2100—U+214F). These are explicitly unified with the characters in this block and corresponding holes have been left in the mathematical alphabets.

Compatibility Decompositions. All mathematical alphanumeric symbols have compatibility decompositions to the base Latin and Greek lettersfolding away such distinctions, however, is usually not desirable as it loses the semantic distinctions for which these characters were encoded. See Unicode Standard Annex #15, Unicode Normalization Forms [Normalization] for more information.

Typical Uses. The following list catalogs examples of typical uses for some of these styles without intending to be exhaustive or exclusive.

2.33 Fonts Used for Mathematical Alphabets

Mathematicians place strict requirements on the specific fonts being used to represent mathematical variables. Readers of a mathematical text need to be able to distinguish single letter variables from each other, even when they do not appear in close proximity. They must be able to recognize the letter itself, whether it is part of the text or is a mathematical variable, and lastly which mathematical alphabet it is from.

Fraktur. The black letter style is often referred to as Fraktur or Gothic in various sources. Technically, Fraktur and Gothic typefaces are distinct designs from black letter, but any of several font styles similar in appearance to the forms shown in the charts can be used.

Math Italics. Mathematical variables are most commonly set in a form of italics, but not all italic fonts can be used successfully. In common text fonts, the italic letter v and Greek letter nu are not very distinct. A rounded italic letter v is therefore preferred in a mathematical font, as long as it is distinct from the Greek upsilon. There are other characters, which sometimes have similar shapes and require special attention to avoid ambiguity. Examples are shown in the table below.

Examples

Theorems are commonly printed in a text italic font. A font intended for mathematical variables should support clear visual distinctions so that variables can be reliably separated from italic text in a theorem. Some languages have common single letter words (English a, Scandinavian i, etc.), which can otherwise be easily confused with common variables.

Hard-to-distinguish Letters. Not all sans-serif fonts allow an easy distinction between lowercase l, and uppercase I and not all monospaced (fixed width) fonts allow a distinction between the letter l and the digit 1. Such fonts are not usable for mathematics. In Fraktur, the letters I and J in particular must be made distinguishable. Overburdened Black Letter forms like I and J are inappropriate. Similarly, the digit zero must be distinct from the uppercase letter O, and the empty set ∅ must be distinct from the letter o with stroke ('Ø' ) for all mathematical alphanumeric sets. Some characters are so similar that even mathematical fonts do not attempt to provide distinguished glyphs for them. Their use is normally avoided in mathematical notation unless no confusion is possible in a given context, for example uppercase A and uppercase Alpha (A).

Font Support for Combining Diacritics. Mathematical equations require that characters be combined with diacritics (dots, tilde, circumflex, or arrows above are common), as well as followed or preceded by super- or subscripted letters or numbers. This requirement leads to designs for italic styles that are less inclined, and script styles that have smaller overhangs and less slant than equivalent styles commonly used for text such as wedding invitations.

Typestyle for Script Characters. In some instances, a deliberate unification with a non-mathematical symbol has been undertaken; for example, U+2133 ℳ script capital m is unified with the pre-1949 symbol for the German currency unit Mark. This unification restricts the range of glyphs that can be used for this character in the charts. Therefore the font used for the reference glyphs in the code charts uses a simplified ‘English Script’ style, as recommended by the American Mathematical Society. For consistency, other script characters in the Letterlike Symbols block are now shown in the same typestyle.

The two characters U+2113 ℓ script small l, and U+2118 ℘ script capital p, are not regular script characters, despite their character names. The latter is the symbol for the Weierstrass elliptic function, a calligraphic letter shape based on the small p, and the former is derived from a special italic letter shape called an 'ell', and is unified with the common non-SI symbol for the liter [SI]. The characters U+1D4C1 mathematical scripts small l and U+1D4AB mathematical script capital p are the preferred characters for the script style.

Double-struck Characters. The double-struck glyphs shown in earlier editions of the standard attempted to match the design used for all the other Latin characters in the standard, which is based on Times. The current set of fonts for use in the character code charts was prepared after consultation with the American Mathematical Society and leading publishers of mathematics, and shows much simpler forms that are derived from the forms written on a blackboard. However, this font represents just one possible representation of double-struck characters; both serifed and non-serifed forms can be used in mathematical texts, and inline fonts are found in works published by certain publishers. Some fonts differ in which strokes of a glyph to double, for example the left or right leg of the uppercase A. There is no intention to support any of these stylistic preferences via character encoding, therefore only one set of double-struck mathematical alphanumeric symbols are encoded.

2.3.1 Representative Glyphs for Greek Phi

With Unicode 3.0 and the concurrent second edition of ISO/IEC 10646-1, the representative glyphs for U+03C6 greek letter small phi and U+03D5 greek phi symbol were exchanged. In ordinary Greek text, the character U+03C6 is used exclusively, although this character has considerable glyphic variation, sometimes represented with a glyph more like the representative glyph shown for U+03C6 (the "loopy" form) and less often with a glyph more like the representative glyph shown for U+03D5 (the “straight“ form). See the Greek table in the character code charts [Charts].

For mathematical and technical use, the straight form of the small phi is an important symbol and needs to be consistently distinguishable from the loopy form. The straight form phi glyph is used as the representative glyph for the phi symbol  at U+03D5 to satisfy this distinction.

The assignment of representative glyphs was reversed in versions of the Unicode Standard prior to Unicode 3.0. As a result, the character explicitly identified as the mathematical symbol did not have the straight form of the character that is the preferred glyph for that use. Furthermore, it made it unnecessarily difficult for general purpose fonts supporting ordinary Greek text to also add support for Greek letters used as mathematical symbols, because many of those fonts already used the loopy form glyph for U+03C6, as preferred for Greek body text. To support the phi symbol as well, they would have had to disrupt glyph choices already optimized for Greek text.

When mapping symbol sets or SGML entities to the Unicode Standard, it is important to make sure that codes or entities, such as phi1, that require the straight form of the phi symbol be mapped to U+03D5 and not to U+03C6. Mapping to the latter should be reserved for codes or entities that represent the small phi as used in ordinary Greek text.

Fonts used primarily for Greek text may use either glyph form for U+03C6, but fonts that also intend to support technical use of the Greek letters should use the loopy form to ensure appropriate contrast with the straight form used for U+03D5.

2.3.2 Representative Glyphs for U+2278 and U+2279

In Unicode 3.2 the representative glyphs for U+2278 neither less-than nor greater-than and U+2279 neither greater-than nor less-than were changed from using a vertical cancellation to using a slanted cancellation to match the long standing canonical decompositions for these characters, which use U+0338 combining long solidus overlay. Irrespective of this change to the representative glyphs, the symmetric forms using the vertical stroke remain acceptable glyph variants. Using U+2276 ≶ or U+2277 ≷ followed by U+20D2 combining long vertical line overlay represents these upright variants explicitly.

Except for those fonts created with the intention to add support for both forms (via combination of U+2276 ≶ or U+2277 ≷ with U+20D2 for the upright forms) there is no need to revise the glyphs for U+2278 and U+2279: the glyphic range implied by using these character codes encompasses both shapes.

2.4  Locating Mathematical Characters

Mathematical characters can be located by looking in the code charts [Charts] at the blocks listed below or by checking the Unicode MATH property, which is assigned to characters that naturally appear in mathematical contexts (see Section 3, Mathematical Character Properties). In the text of this report, all block names are linked to their corresponding online code chart. Mathematical characters can be found in the following blocks:

Table 2.2 Locations of Mathematical Characters

Block Name

Range

Character Types

Basic Latin

U+0021–U+007E

Variables, operators, digits*

Greek

U+0370–U+03FF

Variables*

General Punctuation

U+2000–U+206F

Spaces, Invisible operators*

Letterlike Symbols

U+2100–U+214F

Variables*

Arrows

U+2190–U+21FF

Arrows, arrow-like operators

Mathematical Operators

U+2200–U+22FF

Operators

Miscellaneous Technical Symbols

U+2300–U+23FF

Braces, operators*

Geometrical Shapes

U+25A0–U+25FF

Symbols

Misc. Mathematical Symbols-A

U+27C0–U+27EF

Symbols and operators

Supplemental Arrows-A

U+27F0–U+27FF

Arrows, arrow-like operators

Supplemental Arrows-B

U+2900–U+297F

Arrows, arrow-like operators

Misc. Mathematical Symbols-B

U+2980–U+29FF

Braces, symbols

Suppl. Mathematical Operators

U+2A00–U+2AFF

Operators

Misc. Symbols and Arrows U+2B00-U+2BFF Arrows, operators or symbols

Mathematical Alphanumeric Symbols

U+1D400–U+1D7FF

Variables and digits

Other blocks

Characters for occasional use

*This block contains non-mathematical characters as well.

2.5  Duplicated Characters

Some Greek letters are encoded elsewhere as technical symbols. These include U+00B5 µ micro sign, U+2126 Ω ohm sign, and several characters among the APL functional symbols in the Miscellaneous Technical block. U+03A9 Ω greek letter capital omega is the canonical equivalent of U+2126 Ω and its use is preferred. Micro sign is included in several parts of ISO/IEC 8859, and therefore supported in many legacy environments where U+03BC μ greek letter small mu is not available. Implementations therefore need to be able to recognize the micro sign, even though mu is the preferred character in a Unicode context.

Latin letters duplicated include U+212A K kelvin sign and U+212B Å angstrom sign. As in the case of the ohm sign, the corresponding regular Latin letters are canonical equivalents, therefore their use is preferred.

The left and right angle brackets at U+2329 and U+232A have long been canonically equivalent with the CJK punctuation characters at U+3008 〈 and U+3009 〉. Canonical equivalence implies that the use of the latter code points is preferred and that not only 3008 and 3009 but also the characters 2329 and 232A are ‘wide’ characters. See Unicode Standard Annex #11, East Asian Width [EAW]. Unicode 3.2 added two new mathematical angle bracket characters (U+27E8 ⟨ and U+27E9 ⟩) that are unequivocally intended for mathematical use. 

2.6  Accented Characters

Mathematical characters are often enhanced via use of combining marks in the ranges U+0300..U+036F and the combining marks for symbols in the range U+20D0..U+20FF. These characters follow the base characters as in non-mathematical Unicode text. This section discusses these characters and preferred ways of representing accented characters in mathematical expressions. If a span of characters is enhanced by a combining mark, for example, a tilde over AB, typically some kind of higher-level markup is needed as is done in [MathML]. Unicode does include some combining marks that are designed to be used for pairs of characters, for example, U+0360..U+0362. However, their use for mathematical text is not encouraged.

For some mathematical characters, such as many negated relations, there are multiple ways of expressing the character: as precomposed or as a sequence of base character and combining mark (see also Section 2.17, Negations). Having only a single way to represent any given character would simplify recognizing the character in searches and other manipulations. Selecting a unique representation among multiple equivalent representations is called normalization. Unicode Standard Annex #15 Unicode Normalization Forms [Normalization] discusses the subject in detail; however, due to requirements of non-mathematical software, not all the normalization forms presented there are ideal from the perspective of mathematics.

Ideally, one always uses the shortest form of a math operator symbol wherever possible. So U+2260 ≠ should be used for the not equal sign instead of the combining sequence <003D, 0338>. If a negated operator lacking a precomposed form is needed, U+0338 combining long solidus overlay or U+20D2 COMBINING Vertical LONG OVERLAY can be used to indicate negation. This approach concurs with Normalization Form C (NFC), which is also the preferred normalization form for use on the web. 

On the other hand, for accented alphabetic characters used as variables, ideally only decomposed sequences are used, because mathematics uses a multitude of combining marks that greatly exceeds the predefined composed characters in Unicode. Accordingly, it is better to have the math display facility handle all of these cases uniformly to give a consistent look between characters that happen to have a fully composed Unicode character and those that do not. The combining character sequences also typically have semantics as a group, so it is useful to be able to manipulate and search for them individually without the need for special tables to decompose characters for this purpose. Since there are no precomposed math alphanumeric symbols, this approach concurs with Normalization Form C, except for the upright alphabetic characters (ASCII letters). 

To facilitate interchange on the web, accented characters should conform to NFC when interchanged. However, to achieve consistent results, a mathematical display system should transiently decompose any precomposed upright letters when used in mathematical expressions, and should use a single algorithm to place embellishments.

Normalization Form D (NFD) uses the opposite approach from NFC. It works naturally for mathematical use of alphabetic characters, but does not use the shortest encoding of math operator symbols, making it less attractive. The other two normalization forms NFKC and NFKD remove the distinction between math alphanumeric alphabets, mapping all of them to plain ASCII or Greek characters. As a result they would destroy the semantics of many mathematical expressions, should never be used with mathematical texts.

2.7  Operators

The Mathematical Operators (U+2200—U+22FF) and Supplemental Mathematical Operators (U+2A00—U+2AFF) blocks contain many mathematical operators, relations, geometric symbols and other symbols with special usages confined largely to mathematical contexts. In addition to the characters in these blocks, mathematical operators are also found in the Basic Latin (ASCII) and Latin-1 Supplement Blocks. A few of the symbols from the Miscellaneous Technical block and characters from General Punctuation are also used in mathematical notation. The allocation of any operator to a particular block is rarely significant.

Semantics. Mathematical operators often have more than one meaning in different subdisciplines or different contexts. For example, the "+" symbol normally denotes addition in a mathematical context, but might refer to concatenation in a computer science context dealing with strings, or incrementation, or have any number of other functions in given contexts. Therefore the Unicode Standard only encodes a single character for a single symbolic form. There are numerous other instances in which several semantic values can be attributed to the same Unicode value. For example, U+2218 ∘ ring operator may be the equivalent of white small circle or composite function or apl jot. The Unicode Standard does not attempt to distinguish all possible semantic values that may be applied to mathematical operators or relational symbols. It is up to the application or user to distinguish such meanings according to the appropriate context. Where information is available about the usage (or usages) of particular symbols, it has been indicated in the character annotations in the code charts printed in [Unicode] and in the online code charts [Charts].

Similar Glyphs. The Standard includes many characters that appear to be quite similar to one another, but that may convey different meaning in a given context. On the other hand, mathematical operators, and especially relation symbols, may appear in various standards, handbooks, and fonts with a large number of purely graphical variants. Where variants were recognizable as such from the sources, they were not encoded separately.

For relation symbols, the choice of a vertical or forward-slanting stroke typically seems to be an aesthetic one, but both slants might appear in a given context. However, a back-slanted stroke almost always has a distinct meaning compared to the forward-slanted stroke. See Section 2.18, Variation Selector for more information on some particular variants.

Unifications. Mathematical operators such as implies and if and only if  have been unified with the corresponding arrows (U+21D2 ⇒ rightwards double arrow and U+2194 ↔ left right arrow, respectively) in the Arrows block.

The operator U+2208 ∈ element of is occasionally rendered with a taller shape than shown in the code charts. Mathematical handbooks and standards treat these characters as variants of the same glyph. U+220A ∊ small element of is a distinctively small version of the element of that originates in mathematical pi fonts.

The operators U+226B ≫ much greater-than and U+226A ≪ much less-than are some­times rendered in a nested shape, but the Unicode Standard provides a single encoding for each operator.

A large class of unifications applies to variants of relation symbols involving equality, simi­larity, and/or negation. Variants involving one- or two-barred equal signs, one- or two-tilde similarity signs, and vertical or slanted negation slashes and negation slashes of different lengths are not separately encoded. Thus, for example, U+2288 ⊈ neither a subset of nor equal to, is the archetype for at least six different glyph variants noted in various collections.

In a few exceptional instances, essentially stylistic variants are separately encoded because the need for roundtrip character mapping to other standards that distinguish the two forms. Examples include U+2265 ≥ greater-than or equal to, which is distinguished from U+2267 ≧ greater-than over equal to; the same distinction applies to U+2264 ≤ less-than or equal to and U+2266 ≦ less-than over equal to

Greek-Derived Operators. Several mathematical operators derived from Greek characters have been given separate encodings because they are used differently than the corresponding letters. These operators may occasionally occur in context with Greek-letter variables. They include U+2206 ∆ increment, U+220F ∏ n-ary product, and U+2211 ∑ n-ary summation. The latter two are large operators that take limits. Some typographical aspects of operators are discussed in Section 3.2, Classification by Typographical Behavior. For example, the n-ary operators are distinguished from letter variables by their larger size and the fact that they take limit expressions.

Minus sign. U+2212 − minus sign is the preferred representation of the unary and binary minus sign rather than the ASCII-derived U+002D - hyphen-minus, because minus sign is unambiguous and because it is rendered with a more desirable length, usually longer than a hyphen.

Miscellaneous Symbols.  The symbol U+2205 ∅ empty set is distinct from the letters U+00D8 Ø and U+00F8 ø, even though historically derived from the letter forms. A widespread alternate symbol for the empty set is a slashed digit zero. This can be encoded as U+0030 digit zero followed by U+0338 combining long solidus overlay.

The range from U+22EE ⋮ to  U+22F1 ⋱ contains a set of ellipses used in matrix notation.

U+2023 ‣ TRIANGULAR BULLET and U+25B8 ▸ BLACK RIGHT-POINTING SMALL TRIANGLE are not intended to be distinct in appearance. For historical reasons these two are encoded separately and not made canonical equivalents of each other. U+25B8 ▸ is the preferred character.

2.8  Superscripts and Subscripts

The Superscripts and Subscripts block U+2070.. U+209F together with U+00B2 ², U+00B3 ³, and U+00B9 ¹ contain a collection superscript and subscript digits and punctuation that can be useful in mathematics. If they are used, it is recommended that they be displayed with the same font size as other subscripts and superscripts at the corresponding nested script level. For example, a² and a<super>2</super> should be displayed the same. However, these subscript/superscript characters are not used in MathML or TEX and their use with XML documents for mathematical use is discouraged, see Unicode Technical Report #20, Unicode in XML and other Markup Languages [UXML]. Editors for these formats may offer facilities to convert these characters to regular characters plus markup.

Parsing of Superscript and Subscript Digits. Unlike regular digits the superscript and subscript digits have not been given the General Category property of Decimal_Digit (Nd). This prevents expressions like 23 from being interpreted as 23 by simplistic numeric parsers. More sophisticated numeric parsers, such as general mathematical expression parsers, can nevertheless choose to identify these compatibility superscript and subscript characters as digits and interpret them appropriately within their own scope.

2.9  Arrows

Arrows are used for a variety of purposes in mathematics and elsewhere, such as to imply directional relation, to show logical derivation or implication, and to represent the cursor control keys. Accordingly Unicode includes a fairly extensive set of arrows.  (U+2190..U+21FF,  U+27F0..U+27FF, U+2900..U+297F), many of which appear in mathematics. It does not attempt to encode every possible stylistic variant of arrows separately, especially where their use is mainly decorative. For most arrow variants, the Unicode Standard provides encodings in the two horizontal directions, often in the four cardinal directions. For the single and double arrows, the Unicode Standard provides encodings in eight directions.

Unifications. Arrows expressing mathematical relations have been encoded in the Arrows block as well as in Supplemental Arrows-A and Supplemental Arrows-B. An example is U+21D2 ⇒ rightwards double arrow, which may be used to denote implies. Where available, such usage information is indicated in the annotations to individual characters in the Unicode Standard 5.0 [U5.0], Chapter 17, Code Charts, and in the online code charts [Charts].

Long Arrows. The long arrows encoded in the range U+27F5..U+27FF map to standard SGML entity sets supported by MathML. Long arrows represent distinct semantics from their short counterparts, rather than mere stylistic glyph differences. For example, the shorter forms of arrows are often used in connection with limits, whereas the longer ones are associated with mappings. The use of the long arrows is so common that they were assigned entity names in the ISOAMSA entity set, one of the suite of mathematical symbol entity sets covered by the Unicode Standard.

2.10 Delimiters

The mathematical white square brackets, angle brackets, and double angle brackets encoded at U+27E6..U+27EB are intended for ordinary use of these particular bracket types. They are unambiguously narrow, for use in mathematical and scientific notation, and should be distinguished from the corresponding wide forms of white square brackets, angle brackets, and double angle brackets used in CJK typography. (See the CJK Symbols and Punctuation block.) 

However, the set of lenticular and tortoise-shell brackets in the CJK Punctuation block have not been duplicated because mathematical use has not yet been demonstrated. Fonts containing 'wide glyphs' for these characters that include white space padding, are unsuitable for mathematical or other non-CJK use.

Deprecated Delimiters. The angle brackets formerly aliased as "bra" and "ket", U+2329 〈 left-pointing ANGLE BRACKET and U+232A 〉 right-pointing angle bracket, are now deprecated for use with mathematics because their canonical equivalence to CJK angle brackets is likely to result in unintended spacing problems when used in mathematical formulae.

Horizontal Delimiters. Delimiters are often used horizontally, where they expand to the width of the expression they encompass, as in this example from [TeX].

overbrace example

By providing character codes for these delimiters, mathematical layout systems can be designed so that both regular and horizontal delimiters are encoded as characters, with markup designating the scope where necessary. When the horizontal mathematical brackets are used, all other letters, symbols and digits remain upright as illustrated in the example above. Table 2.3 lists the Unicode characters for horizontal delimiters.

Table 2.3 : Horizontal Delimiters

Code

Description

23B4

TOP SQUARE BRACKET
23B5 BOTTOM SQUARE BRACKET
23DC TOP PARENTHESIS
23DD BOTTOM PARENTHESIS
23DE TOP CURLY BRACKET
23DF BOTTOM CURLY BRACKET
23E0 TOP TORTOISE SHELL BRACKET

Use of horizontal delimiters is different from horizontal display of delimiters in vertical layout of East Asian text, where ideographic characters remain upright, but non-ideographic characters (letters, digits) are rotated 90°. CJK parens in vertical text exampleFor example, the parentheses in the vertical text in the figure to the right have very different rendering from the under/overbrace examples above.

The CJK Compatibility Forms U+FE35 ︵ through U+FE39 ︹ have shapes that are superficially similar to the horizontal delimiters, but these characters are not mathematical and have quite different rendering requirements. They are encoded for compatibility with character sets that use explicit character codes for the vertical glyph variants of punctuation characters. Like other CJK punctuation, CJK Compatibility Forms have the [EAW] property of W (wide) and are typically implemented in one half of an EM square, with the other half empty. Layout algorithms using these characters predict the empty half cell based on the character code, and reduce intercharacter spacing accordingly in some circumstances.

2.11  Geometrical Shapes

The basic geometric shapes (circle, square, triangle, diamond, and lozenge) are used for a variety of purposes in mathematical texts. Because their shapes are distinct and they are easily available in multiple sizes from a variety of widely available fonts, they are also often used in an ad-hoc manner. In Unicode they are encoded in the Geometrical Shapes, Miscellaneous Technical, Block Elements, Miscellaneous Symbols and Miscellaneous Symbols and Arrows blocks as shown in Table 2.4.

Ideal Sizes. Mathematical usage requires at least four distinct sizes of simple shapes, and sometimes more. The size gradation must allow each size to be recognized, even when it occurs in isolation. In other words, shapes of the same size should ideally have roughly the same visual "impact" as opposed to same nominal height or width. The shapes shown here for a given size all have the same area.

For mathematical usage simple shapes ideally share a common center. The following diagram shows the ideal size relationship across shapes of the same nominal size.

size relations

The precise sizes and shapes chosen, however, are a matter for the font designer. Note that neither the current set of representative glyphs in the standard nor the glyphs from many commonly available non-mathematical fonts achieve the ideals set forth here.

Note to reviewers: In a previous review cycle, a reanalysis of this material has been proposed in document L2/06-034 (accessible to Unicode members). This has been reviewed by the authors in consultations with other mathematical experts. The conclusion is to request additions to the repertoire.  Table 2-4 has been updated accordingly, with the proposed additions to the repertoire indicated as shapes, but with dummy code points.

Suggested Sizes [proposed]. The sizes of existing characters and their names as shown in the code charts are not always consistent. The suggested sizes here correspond to a geometric progression where for each size all characters have the same visual impact. Shapes for which only one of the columns with  a "default" size exists can be implemented either as regular or medium size. The former is shown here, the latter may be more suitable for mathematical work.   Table 2.4 summarizes the available sizes for a given symbol. 

Table 2.4 Sizes of Simple Shapes [proposed]

Shape tiny very small small
(Bullet)
medium small medium
(default1)
regular
(default2)
large
triangle left       25C2
25C2
25C3
25C3
        25C0
25C0
25C1
25C1
   
triangle right       25B8
25B8
2023
25B9

25B9
        25B6
25B6
25B7
25B7
   
triangle up       25B4
25B4
25B5
25B5
        25B2
25B2
25B3
25B3
   
triangle down       25BE
25BE
25BF
25BF
        25BC
25BC
25BD
25BD
   
square   black very small square
2Bxx
white very small square
2Bxx
25AA
25AA
25AB
25AB
25FD
25FD
25FE
25FE
25FC
25FC
25FB
25FB
25A0
25A0
25A1
25A1
black large square
2Bxx
white large square
2Bxx
diamond       black small diamond
2Bxx
22C4
22C4
black med small diamond
2Bxx
white med. small diamond
2Bxx
    25C6
25C6
25C7
25C7
   
lozenge       black small lozenge
2Bxx
white small lozengs
2Bxx
black med small lozenge
2Bxx
white med. small lozenge
2Bxx
    29EB
29EB
25CA
25CA
   
pentagon                   black pentagon
2Bxx
2B20
2B20
   
pentagon right                   black right-pointing pentagon
2Bxx
white right-pointing pentagon
2Bxx
   
hexagon horizontal                   2B23
2B23
2394
2394
   
hexagon vertical                   2B22
2B22
2B21
2B21
   
arabic star       066D
066D
small star
2Bxx
22C6
22C6
medium small star
2Bxx
2605
2605
2606
2606
       
ellipse horizontal                   horiz black ellipse
2Bxx
horiz white ellipse
2Bxx
   
ellipse vertical                   vertical black ellipse
2Bxx
vertical white ellipse
2Bxx
   
circle 22C5
22C5
2219
2219
00B7
2218
2218
2022
2022
25E6
25E6
2981
2981
26AC
26AC
26AB
26AB
26AA
26AA
25CF
25CF
25CB
25CB
black large circle
2Bxx
25EF
25EF
circled circles 2299
2299
2609
2609
    233E
233E
               
circled circles 2A00
2A00
29BF
29BF
229A
229A
  29BE
29BE
25C9
25C9
25CE
25CE
           

Most simple geometrical shapes exist in both black and outline (white) form in a single default size. The default size as shown in the code charts would be in the column marked "regular", while for many font implementations, a size corresponding to the column marked "medium" is chosen. As it is difficult to distinguish higher-order polygons at smaller sizes, size distinctions for these shapes are less useful for notational purposes. Triangles exist in two sizes, a default size and a small, bullet size. Lozenges and diamonds exist in a default size, and interim size and a bullet size. Squares and circles exist in black and white in all sizes from very small to large. There is also a tiny circle, essentially a centered dot. At the tiny size, distinction between different shapes, or black and outline forms, becomes impossible.  

Arrangement in Code Space. For circles in particular, but also for lozenges, diamonds and stars, the white and black forms are not encoded under matching names or close together. The series of circled circles is also distributed across the Unicode code space.

Sizes of Derived Shapes. Circled and squared operators and similar derived shapes are more constrained in their usage than "plain" geometric shapes. They tend to occur in two generic sizes based on function: a smaller size for binary operators and large size for n-ary operators. Other than circled circles, they are not shown here. Circled circles come in two series, based on the size of the enclosing circle.

Orientation. Some geometric shapes can exist in more than one orientation. For triangles, the Unicode Standard encodes the four principal directions. Ovals, pentagons and hexagons  exist in two orientations;  U+2394 ⎔ SOFTFWARE FUNCTION SYMBOL can be used as a horizontal white hexagon. The choice of right-pointing pentagon is based on its use as an avatar of the unit pentagon on the complex plane. Generic use in geometry would use the upright orientation.

Positioning. For a mathematical font, the centerline should go through the middle of a parenthesis, which should go from bottom of descender to top of ascender. This is the same level as the minus or the middle of the plus and equal signs. For correct positioning, the glyph will descend below the baseline for the larger sizes of the basic shapes as in the following schematic diagram:

centerline alignment

The standard triangles used for mathematics are also center aligned. This differs from the positioning for the representative glyphs shown in the charts, which are often based on existing non-mathematical fonts. Therefore, mathematical fonts may need to deviate in positioning of these triangles.

2.12  Other Symbols

Other symbols used in mathematics are contained in the Miscellaneous Technical block (U+2300—U+23FF), the Geometric Shapes block (U+25A0—U+25FF), the Miscellaneous Symbols block (U+2600—U+267F), and the General Punctuation block (U+2000—U+206F).

Generally any easily recognized and distinct symbol is fair game for mathematicians faced with the need of creating notations for new fields of mathematics. For example, the card suits, U+2665 ♥ black heart suit, U+2660 ♠ black spade suit, etc. can be found as operators and as subscripts.

2.13  Symbol Pieces

The characters from the Miscellaneous Technical block in the range U+239B—U+23B3, plus U+23B7, comprise a set of bracket and other symbol fragments for use in mathematical typesetting. These pieces originated in older font standards, but have been used in past mathematical processing as characters in their own right to assemble extra-tall glyphs for enclosing multi-line mathematical formulae. Mathematical fences are ordinarily sized to the content that they enclose. However, in creating a large fence, the glyph is not scaled proportionally; in particular the displayed stem weights must remain compatible with the accompanying smaller characters. Thus, simple scaling of font outlines cannot be used to create tall brackets. Instead, a common technique is to build up the symbol from pieces. In particular, the characters U+239B LEFT PARENTHESIS UPPER HOOK through U+23B3 SUMMATION BOTTOM represent a set of glyph pieces for building up large versions of the fences (, ), [, ], {, and }, and of the large operators ∑ and ∫. These brace and operator pieces are compatibility characters. They should not be used in stored mathematical text, but are often used in the data stream created by display and print drivers.

Table 2.5 shows which pieces are intended to be used together to create specific symbols.

Table 2.5 Use of Symbol Pieces

 

2-row

3-row

5-row

Summation

23B2, 23B3

 

 

Integral

2320, 2321

2320, 23AE, 2321

2320, 3×23AE, 2321

Left Parenthesis

239B, 239D

239B, 239C, 239D

239B, 3×239C, 239D

Right Parenthesis

239E, 23A0

239E, 239F, 23A0

239E, 3×239F, 23A0

Left Bracket

23A1, 23A3

23A1, 23A2, 23A4

23A1, 3×23A2, 23A3

Right Bracket

23A4, 23A6

23A4, 23A5, 23A6

23A4, 3×23A5, 23A6

Left Brace

23B0, 23B1

23A7, 23A8, 23A9

23A7, 23AA, 23A8, 23AA, 23A9

Right Brace

23B1, 23B0

23AB, 23AC, 23AD

23AB, 23AA, 23AC, 23AA, 23AD

For example, an instance of U+239B can be positioned relative to instances of U+239C and U+239D to form an extra-tall (three or more line) flattened left parenthesis. The center sections are meant to be used only with the top and bottom pieces encoded adjacent to them, since the segments are usually graphically constructed within the fonts so that they match perfectly when positioned at the same x coordinates.

2.14  Invisible Operators

In mathematics some operators or punctuation are often implied, but not displayed. This poses few problems to the human reader, as the meaning is usually clear from context. However, machine interpretation of mathematical expressions may need the intent be made more explicit. To support this without altering the appearance of the equation when displayed, the Unicode Standard provides several invisible operators that can be used to unambiguously denote the intent whenever an operator is implied, or more importantly when more than one operator could be implied. Use of invisible operators is optional and is not required for intended for interchange with math-aware programs.

Invisible Separator. U+2063 invisible separator or invisible comma is intended for use in index expressions and other mathematical notation where two adjacent variables form a list and are not implicitly multiplied. In mathematical notation, commas are not always explicitly present, but need to be indicated for symbolic calculation software to help it disambiguate a sequence from a multiplication. For example, the double ij subscript in the variable aij means ai, j — that is, the i and j are separate indices and not a single variable with the name ij or even the product of i and j. Accordingly to represent the implied list separation in the subscript ij one can insert a non-displaying invisible separator between the i and the j. In addition, use of the invisible comma would hint to a math layout program to set a small space between the variables.

Invisible Multiplication. Similarly, an expression like mc2 implies that the mass m multiplies the square of the speed c. To unambiguously represent the implied multiplication in mc2, one inserts a non-displaying U+2062 invisible times between the m and the c. Another example is the expression f ij(cos(ab)), which means the same as f i,j(cos(a×b)), where × is used here to represents multiplication, not the cross product. Note that the spacing between characters may also depend on whether the adjacent variables are part of a list or are to be concatenated, that is, multiplied.

Invisible Function Application. U+2061 FUNCTION APPLICATION is used for an implied function dependence as in f(x + y). To indicate that this is the function f of the quantity x + y and not the expression fx + fy, one can insert the non-displaying function application symbol between the f and the left parenthesis.

Invisible Plus [proposed]. The final member of this set of invisible operators denoting the implied intent of juxtaposition in uses where it is not possible to rely on a human reader to disambiguate is a [proposed] invisible plus operator character to be able to unambiguously represent expressions like , which occur frequently in school or engineering texts. Not having an operator at all would imply multiplication as in the example

3 abc/d

where the 3 represents a factor multiplying the following fraction.

2.15  Fraction Slash

U+2044  ⁄ fraction slash is used to build up simple fractions in running text. It applies to the immediately adjacent sequences of decimal digits, that is characters with the General Category=Nd. In general mathematical use a more general method for layout of fractions is needed, however parsers of mathematical texts should be prepared to handle fraction slash when it is received from other sources.

2.16  Other Characters

All remaining Unicode characters may appear in mathematical expressions, typically in spelled-out names for variables in fractions or simple formulae, but they most commonly appear in ordinary text. An English example is the equation

distance = rate × time,

which uses ordinary ASCII letters to aid in recognizing sequences of letters as words instead of products of individual symbols. Such usage corresponds to identifiers as discussed elsewhere in this report.

2.17  Negations

Many negated forms, particularly of relations, can be encoded by using the base symbol, together with a combining overlay. Occasionally, both a vertical and a slanted negation are used; which one is often a matter of style. Sometimes the negation is only indicated for part of a symbol. In these cases, the negated relations are encoded directly, and variants can be accessed via the variation selector method described in the next section.

Table 2.6 lists the currently encoded negated mathematical relations for which a variant can be realized via composition, by using U+20D2 combining long vertical line overlay together with a base character. In the table, the part of the description in small caps is the character name of the corresponding standard character; the part in lowercase indicates the variation in appearance.

Table 2.6 Negated Relations Using Vertical Line Overlay

Std Symbol Alternate Symbol Description of alternate symbol
U+2209 2209 U+2208,U+20D2 2208,20D2 not an element of with vertical stroke
U+220C 220C U+220B,U+20D2 220B,20D2 does not contain as member with vertical stroke
U+2241 2241 U+223C,U+20D2 223C,20D2 not tilde with vertical stroke
U+2244 2244 U+2243,U+20D2 2243,20D2 not asymptotically equal to with vertical stroke
U+2247 2247 U+2245,U+20D2 2245,20D2 neither approximately nor actually equal to with vertical stroke
U+2249 2249 U+2248,U+20D2 2248,20D2 not almost equal to with vertical stroke
U+2260 2260 U+003D,U+20D2 003D,20D2 not equal to with vertical stroke
U+2262 2262 U+2261,U+20D2 2261,20D2 not identical to with vertical stroke
U+226D 226D