PRI #228: Changing some common characters from Punctuation to Symbol

The Unicode Technical Committee is requesting feedback on a proposal for changes for some common characters from Punctuation to Symbol. This proposal is intended to better reflect the processing of these characters in cases where the distinction between Punctuation and Symbol is significant. Because these characters are quite common, the proposed change may impact a large number of implementations.

People have questioned why certain characters such as the number sign (#) and at sign (@) are classified as punctuation in the Unicode Standard, when they seem more accurately characterized as symbols, and when seemingly similar characters, such as the section sign (§) and the copyright sign (©), are classed as symbols.

This categorization is defined in the Unicode Standard by use of the General_Category property (gc). The General_Category property values as gc=Symbol versus gc=Punctuation makes a significant difference to implementations. For example, Punctuation characters are commonly ignored in searching and collation, while Symbol characters are not. This is the case in CLDR collation. As another example, Symbol characters are commonly excluded from registered personal names, whereas some punctuation characters are allowed.

The list of characters being considered for this change in General_Category is:

U+0023 ( # ) NUMBER SIGN
U+0026 ( & ) AMPERSAND
U+0040 ( @ ) COMMERCIAL AT
U+0025 ( % ) PERCENT SIGN
U+2030 ( ‰ ) PER MILLE SIGN
U+002A ( * ) ASTERISK
U+2020 ( † ) DAGGER
U+2021 ( ‡ ) DOUBLE DAGGER

Note: The character U+002D ( - ) HYPHEN-MINUS was originally part of the list, but was withdrawn based on feedback.

The UTC is seeking feedback as to whether this entire list should have its General_Category value to Symbol, or whether some subset of these should be changed.

Some possible advantages of making such a change would include the fact that implementations get more expected behavior "out of the box", without having to customize code that depends on this data. For example, these characters would no longer be ignored by default in implementations of CLDR collation. Behavior for regular expressions using Symbol or Punctuation may also match expectations better.

Some possible disadvantages would include the fact that many of these characters are quite common, and the change may negatively impact some implementations that depend on the distinction between Punctuation and Symbol. In particular, the case for changing the General_Category value of U+002D ( - ) HYPHEN-MINUS is problematical. The semantics for that character are overloaded because it functions both as a minus-sign (symbol) and as a hyphen (punctuation).

The UTC also welcomes feedback offering alternative approaches that might address this issue in a different manner.