Punctuation and Symbols
Q: What does the Unicode Standard mean by punctuation marks, as opposed to symbols?
A: Punctuation marks are standardized marks or signs used to clarify the meaning and separate structural units of text. The General Category property assigns characters to either Punctuation or Symbol based on their primary usage.
Q: Is the line between the categories of Punctuation and Symbol always clear?
A: No, in many cases the distinction between punctuation marks and symbols is not clear-cut. Punctuation marks such as period (full-stop), comma, parentheses, and so on are unambiguously punctuation, and characters such as heart (♥), not equals (≠), and smiling-face (☺) are unambiguously symbols.
However, for historical reasons certain characters such as number sign (#), ampersand (&), commercial at sign (@), and percent sign (%) are also considered Punctuation in Unicode, while seemingly similar characters are classed as Symbol, such as the section sign (§) and copyright sign (©).
Q: Is the classification of a character as Punctuation the same in all contexts, and is it based on the General Category?
A: No, in some contexts, such as mathematical usage, there are entirely different classifications, based on different properties than the General Category. Characters are classified by whether they represent variables, numbers, or operators (and, if so what kind). These classifications can be very different from ordinary text usage, and can depend on context. For example, the character "!" could be classified as an operator or punctuation depending on where it appears in a document.
Q: Can you give a more complete list of Unicode punctuation characters that often should be treated as symbols?
A: Because this issue is about conflicting usage, it is not possible to generate a comprehensive list, however some of the most common ones are:
U+0023 ( # ) NUMBER SIGN
U+0026 ( & ) AMPERSAND
U+0040 ( @ ) COMMERCIAL AT
U+0025 ( % ) PERCENT SIGN
U+2030 ( ‰ ) PER MILLE SIGN
U+2031 ( ‱ ) PER TEN THOUSAND SIGN
U+002A ( * ) ASTERISK
U+2020 ( † ) DAGGER
U+2021 ( ‡ ) DOUBLE DAGGER
U+203B ( ※ ) REFERENCE MARK
Q: Isn't the U+002D HYPHEN-MINUS also an important case?
A: Yes. The character U+002D HYPHEN-MINUS is used both as a hyphen (punctuation mark) and as a minus sign (math symbol), with the intended meaning only apparent from context. Because only a single value is assigned for the General Category, this character has the value "Pd". At the same time, unambiguous Unicode characters also exist for each of these functions: U+2010 ( ‐ ) HYPHEN and U+2212 ( − ) MINUS SIGN. The latter has been given the General Category of Sm (mathematical symbol).
Q: Can a conformant implementation treat these as symbols?
A: Yes. The General Category only defines the principal usage of a character. It is not intended to capture all aspects of the use of any given character. Therefore, the General Category does not define whether to treat characters as punctuation or symbols in all contexts.
For example, a character picker could list U+0023 ( # ) NUMBER SIGN under Symbols (or both under Symbols and Punctuation). An "Ignore Punctuation" option in search need not ignore U+0040 ( @ ) COMMERCIAL AT. A search engine could ignore punctuation in general, but treat the above list as symbols for the purpose of search.
However, there's one caveat—a conformant implementation that reports the General Category value for a character must use the actual, unmodified value. For example, a conformant regular expression syntax that allows for selection of character by "General Category" must match U+0040 ( @ ) COMMERCIAL AT based on its actual General Category value (Po).