L2/99195
DATE: 19990902
DOC TYPE: 
Expert contribution 
TITLE: 
Proposal to encode mathematical alphanumeric symbols 
SOURCE: 
Murray Sargent III, Barbara Beeton 
PROJECT: 

STATUS: 
Proposal 
ACTION ID: 
FYI 
DUE DATE: 
 
DISTRIBUTION: 
Worldwide 
MEDIUM: 
Paper and html 
NO. OF PAGES: 
5 
A. Administrative 

1. Title 
Proposal to encode mathematical alphanumeric symbols 
2. Requester's name 
Murray Sargent III, Barbara Beeton 
3. Requester type 
Expert request. 
4. Submission date 
199999 
5. Requester’s reference 
Scientific and Technical Information Exchange (STIX) 
6a. Completion 
Complete proposal 
6b. More information to be provided? 
If requested 
B. Technical  General 

1a. New script? Name? 
No. 
1b. Addition of characters to existing block? Name? 
No. 
2. Number of characters 
991 new alphanumeric symbols 
3. Proposed category 

4. Proposed level of implementation and rationale 
Level 1 
5a. Character names included in proposal? 
Yes 
5b. Character names in accordance with guidelines? 
Yes 
5c. Character shapes reviewable? 

6a. Who will provide computerized font? 
None needed (they already exist) 
6b. Font currently available? 
None needed (standard fonts are adequate) 
6c. Font format? 
na 
7a. Are references (to other character sets, dictionaries, descriptive texts, etc.) provided? 
Yes. 
7b. Are published examples (such as samples from newspapers, magazines, or other sources) of use of proposed characters attached? 
Not attached, but available. 
8. Does the proposal address other aspects of character data processing? 
No 
C. Technical  Justification 

1. Contact with the user community? 
Yes. Patrick Ion, Barbara Beeton, Murray Sargent III, MathML W3C Math Working Group 
2. Information on the user community? 
Professional mathematicians, physicists, astronomers, engineers, and other scientific and technical researchers. 
3a. The context of use for the proposed characters? 
Used in publication of research mathematics and other hard sciences. 
3b. Reference 

4a. Proposed characters in current use? 
Yes 
4b. Where? 
Worldwide, by scientific and technical publishers and other users of mathematics 
5a. Characters should be encoded entirely in BMP? 
No, entirely in plane 1. 
5b. Rationale 
Accurate publication of mathematical and scientific research on the Web is impossible without a comprehensive and accurate collection of symbols including various alphabetic variants in common use. 
6. Should characters be kept in a continuous range? 
Yes in order to fit in one 1024character surrogate block 
7a. Can the characters be considered a presentation form of an existing character or character sequence? 
No. A given alphabetic symbol has different semantics when its style is changed and should not be found by the same plaintext search string. 
7b. Where? 

7c. Reference 

8a. Can any of the characters be considered to be similar (in appearance or function) to an existing character? 
Some letterlike symbols look similar to corresponding characters in some alphabets, e.g., some capital script letters. These are left as holes in the proposed code assignments 
8b. Where? 
Letterlike symbols 
8c. Reference 

9a. Combining characters or use of composite sequences included? 
No 
9b. List of composite sequences and their corresponding glyph images provided? 
No 
10. Characters with any special properties such as control function, etc. included? 
No 
D. SC2/WG2 AdministrativeTo be completed by SC2/WG2 

1. Relevant SC 2/WG 2 document numbers: 

2. Status (list of meeting number and corresponding action or disposition) 

3. Additional contact to user communities, liaison organizations etc. 

4. Assigned category and assigned priority/time frame 

Other Comments 

Mathematics has need for a number of Latin and Greek alphabets that on first thought appear to be font variations of one another, e.g., normal, bold, italic and script H. However in any given document, these characters have distinct mathematical semantics. For example, a normal H represents a different variable from a bold H, etc. If one drops these distinctions in plain text, one gets gibberish. Instead of the wellknown Hamiltonian formula
H = òdt(eE² + mH²),
you’d get the integral equation (!)
H = òdt(eE² + mH²).
Accordingly, the STIX project requests adding normal, bold, italic, script, etc., Latin and Greek alphabets. Straight encoding leads to 991 characters and loses some useful common information, such as all variants of H might not be trivially recognizable as H’s. But it does allow plain text to retain the proper character semantics and it allows simple (nonrich) search methods to work.
Math
style 
Characters 
Count 
Proposed Name Prefix 
bold 
az, AZ, 09, aw, AΩ (Greek) 
120 
MATH BOLD 
italic 
az, AZ, aw, AΩ (Greek) 
110 
MATH ITALIC 
bold italic 
az, AZ, aw, AΩ (Greek) 
110 
MATH BOLD ITALIC 
script (calligraphic) 
az, AZ 
52 
MATH SCRIPT 
bold script (calligraphic) 
az, AZ 
52 
MATH SCRIPT BOLD 
fraktur 
az, AZ 
52 
MATH FRAKTUR 
bold fraktur 
az, AZ 
52 
MATH BOLD FRAKTUR 
openface 
az, AZ, 09 
62 
MATH OPENFACE 
sansserif 
az, AZ, 09 
62 
MATH SANS 
sansserif bold 
az, AZ, 09, aw, AΩ (Greek) 
120 
MATH SANS BOLD 
sansserif italic 
az, AZ 
52 
MATH SANS ITALIC 
sansserif bold italic 
az, AZ, aw, AΩ (Greek) 
110 
MATH SANS BOLD ITALIC 
monospace 
az, AZ, 09 
62 
MATH MONOSPACE 
Note that which normal, script, fraktur, openface, sansserif, or monospace
fonts are used is beyond the scope of plaintext. The uppercase Greek letters AΩ are defined by
the Unicode Greek character range U+0391 through U+03A9 plus the nabla Ñ (U+2207). aw are
defined by the Unicode Greek character range U+03B1 through U+03C9 plus the
partial differential sign ¶ (U+2202) and the six
glyph variants of e,
q,
κ, f, ρ, and π, given by (new BMP code that resembles U+220A), U+03D1,
U+03F0, U+03D5, U+3C1, U+03F1, and U+03D6 (since both glyphs for each of these
can appear in the same document with different semantics). The uppercase
position U+03A2 corresponding to the final sigma ς is used for the uppercase Θ variant, which looks like
the usual Θ except that
the “H” in the middle is replaced by a “”. This gives 25+1 uppercase Greek
characters and 25+7 lowercase characters.
In addition, corresponding characters in the BMP are used for upright
serifed characters when they occur in mathematical expressions.
Outright encoding stores the alphabets in plane 1 for a total of 991 characters as currently entered. No accented characters are included. Accented mathematical symbols are always represented by combining character sequences. These characters fit into 1024 code positions (addressable using one highsurrogate value) by using the following scheme (one column is one hexadecade):
Character
group 
Total
number 
Layout 
#
of Columns 
13 Latin alphabets 
676 
42 full columns + 4 chars 
42+ columns 
5 Greek alphabets 
290 
18 full columns + 2 chars 
18+ columns 
5 sets of digits 
50 
3 full columns + 2. End at end of 1024char block 
3+ columns 
Exclusions (below) 
25 


3 groups – 25 exclus. 
991 

64 columns 
This is a total of 64 columns, which is a block of 1024 (one surrogate block). The suggested block is D400…D7FF on plane 1. Please see proposed explicit code assignments at the end of this document.
An alternative approach to separate encoding of each style of alphanumeric mathematical alphabet was given serious consideration (see L2/99188). That consisted of encoding a math style variant tag for each significant difference in alphabets. However, while technically feasible for the representation of the required mathematical alphabets, the encoding of such math style variant tags raised unanswerable questions regarding what would happen if they were applied outside their intended mathematical domainfor example, to accented Latin letters, or even to other scripts such as Han characters. This approach was also too close to the introduction of explicit stylistic markup into the character encoding – something regarding which there is general consensus in the character encoding committees that such a step would be undesireable.
The
proposed block title name is “Mathematical Alphabets”.
The character names are those used for the corresponding characters
in the BMP with the proposed prefixes given in the table above, but simplified
as in the Letterlike Symbols block. For
example, the character H in the Hamiltonian formula above has
the proposed name “MATH SCRIPT CAPITAL H”.
The code position for this particular character is marked as reserved,
since the character already exists in the LetterlikeSymbols block with the
name “SCRIPT CAPITAL H”. This and other
such reserved code positions are listed in the next section.
The code positions for the following characters should be left <reserved> (unassigned), since these characters already appear in the LetterlikeSymbols block:
Math
character 
Letterlike Symbol character

Code 
MATH OPENFACE CAPITAL C 
DOUBLESTRUCK CAPITAL C 
2102 
MATH SCRIPT SMALL G 
SCRIPT SMALL G 
210A 
MATH SCRIPT CAPITAL H 
SCRIPT CAPITAL H 
210B 
MATH FRAKTUR CAPITAL H 
BLACKLETTER CAPITAL H 
210C 
MATH OPENFACE CAPITAL H 
DOUBLESTRUCK CAPITAL H 
210D 
MATH ITALIC SMALL H 
PLANCK CONSTANT 
210E 
MATH SCRIPT CAPITAL I 
SCRIPT CAPITAL I 
2110 
MATH FRAKTUR CAPITAL I 
BLACKLETTER CAPITAL I 
2111 
MATH SCRIPT CAPITAL L 
SCRIPT CAPITAL L 
2112 
MATH SCRIPT SMALL L 
SCRIPT SMALL L 
2113 
MATH OPENFACE CAPITAL N 
DOUBLESTRUCK CAPITAL N 
2115 
MATH OPENFACE CAPITAL P 
DOUBLESTRUCK CAPITAL P 
2119 
MATH OPENFACE CAPITAL Q 
DOUBLESTRUCK CAPITAL Q 
211A 
MATH SCRIPT CAPITAL R 
SCRIPT CAPITAL R 
211B 
MATH FRAKTUR CAPITAL R 
BLACKLETTER CAPITAL R 
211C 
MATH OPENFACE CAPITAL R 
DOUBLESTRUCK CAPITAL R 
211D 
MATH OPENFACE CAPITAL Z 
DOUBLESTRUCK CAPITAL Z 
2124 
MATH FRAKTUR CAPITAL Z 
BLACKLETTER CAPITAL Z 
2128 
MATH SCRIPT CAPITAL B 
SCRIPT CAPITAL B 
212C 
MATH FRAKTUR CAPITAL C 
BLACKLETTER CAPITAL C 
212D 
MATH SCRIPT SMALL E 
SCRIPT SMALL E 
212F 
MATH SCRIPT CAPITAL E 
SCRIPT CAPITAL E 
2130 
MATH SCRIPT CAPITAL F 
SCRIPT CAPITAL F 
2131 
MATH SCRIPT CAPITAL M 
SCRIPT CAPITAL M 
2133 
MATH SCRIPT SMALL O 
SCRIPT SMALL O 
2134 
Rendering mathematics requires a fairly sophisticated 2D layout engine. Compared to the complexity needed in this engine, handling surrogate pairs is straightforward. Furthermore it is anticipated that handling surrogate pairs will be easy, partly because they will be handled by computer operating systems thanks to the strong business case for supporting East Asian characters in plane 2.
The
proposed code assignments will be given in Unicode layout format shortly. A
layout in the form of ISO/IEC 106462 can be provided to the editor.