Punctuation and Symbols

Q: What does the Unicode Standard mean by punctuation marks, as opposed to symbols?

Punctuation marks are standardized marks or signs used to mark the structure or clarify the meaning of text. Symbols usually have a meaning of their own. The General Category property assigns characters to either Punctuation or Symbol based on their primary usage.

Q: Is there a clear distinction between the Punctuation and Symbol categories?

The distinction between punctuation marks and symbols is not absolute. Punctuation marks such as period (full stop), comma, parentheses, and so on are unambiguously punctuation, and characters such as heart ( ♥ ), not equals ( ≠ ), and smiling-face ( ☺ ) are unambiguously symbols. Characters such as number sign ( # ), ampersand ( & ), commercial at sign ( @ ), and percent sign ( % ) are considered Punctuation in Unicode, while seemingly similar characters are classed as Symbol, such as the section sign ( § ) and copyright sign ( © ).

Q: Is the classification of a character as Punctuation the same in all contexts, and is it always based on the General Category?

In some contexts, such as mathematical usage, the most relevant classifications are based on properties other than the General Category: characters are classified by whether they represent variables, numbers, or operators (and, if so what kind). These classifications can be very different from ordinary text usage, and can depend on context. For example, the character "!" could be classified as an operator or punctuation depending on whether it appears in a formula or elsewhere in a document.

Q: Is there a list of Unicode punctuation characters that should also be treated as symbols?

There is no comprehensive list, but here are some of the most common examples of punctuation characters that are also used as symbols:

U+0023 ( # ) NUMBER SIGN
U+0026 ( & ) AMPERSAND
U+0040 ( @ ) COMMERCIAL AT
U+0025 ( % ) PERCENT SIGN
U+2030 ( ‰ ) PER MILLE SIGN
U+2031 ( ‱ ) PER TEN THOUSAND SIGN
U+002A ( * ) ASTERISK
U+2020 ( † ) DAGGER
U+2021 ( ‡ ) DOUBLE DAGGER
U+203B ( ※ ) REFERENCE MARK

There is a list of characters that shows their typical use in the context of mathematics.

Q: What about U+002D HYPHEN-MINUS?

The character U+002D HYPHEN-MINUS is used both as a hyphen (punctuation mark) and as a minus sign (math symbol), with the intended meaning only apparent from context. Because only a single value is assigned for the General Category, this character has the value "Pd". At the same time, unambiguous Unicode characters also exist for each of these functions: U+2010 ( ‐ ) HYPHEN and U+2212 ( − ) MINUS SIGN. The latter has been given the General Category of Sm (mathematical symbol).

Q: Can a conformant implementation treat punctuation marks as symbols?

Yes. The General Category only defines the principal usage of a character. It is not intended to capture all aspects of the use of any given character. Therefore, the General Category does not define whether to treat characters as punctuation or symbols in a specific context.

For example, a character picker could list U+0023 ( # ) NUMBER SIGN under Symbols (or both under Symbols and Punctuation). An "Ignore Punctuation" option in search need not ignore U+0040 ( @ ) COMMERCIAL AT. A search engine could ignore punctuation in general, but treat the above list as symbols for the purpose of search.

However, there's one caveat—a conformant implementation that reports the General Category value for a character must use the actual, unmodified value. For example, a conformant regular expression syntax that allows for selection of character by "General Category" must match U+0040 ( @ ) COMMERCIAL AT based on its actual General Category value (Po).

Q: Is the width of an en dash always half the width of an em dash?

As its name implies, U+2014 ( — ) EM DASH is typically the width of an em for the given font, and similarly for U+2013 ( – ) EN DASH, which traditionally is half the width. Like their counterparts EM SPACE and EN SPACE, these are intended to represent the modern digital equivalents of the traditional sorts. However, none of the traditional typographic conventions for their use are specified or enforced by the Unicode Standard. It is left to the discretion of the font designer whether to deviate from these traditions, bearing in mind that fonts with non-typical width relations may not work as intended for some users.