L2/12-295

PRI #229: Line breaking changes for pictographic symbols: Background

Emoji are now being used in many contexts other than Japanese e-mail and text messages. The current line break property value of most Unicode characters used for emoji is AL (alphabetic letter), which is causing problems in many of these contexts. For example, here are short lines (adapted from examples promoting an emoji application) mixing Latin characters and emoji, without spaces; there is no line break opportunity anywhere in the sentences:

Bad line break in short lines with emoji

A Japanese version of the example above would be less problematic because most Japanese characters have line break property value ID and could break before or after emoji. However, long strings of emoji are still a problem, because there is no break opportunity in the string.

This is a special case of a more general problem with the LineBreak class assignments for many symbols of GeneralCategory=So and Script=Common. Nearly all such symbols have one of the three following LineBreak classes (the number of such symbols in the category is given in parentheses):

A few (11) other So/Common symbols have special LineBreak classes such Exclamation, Punctuation, etc.

Characters with problematic line breaking are thus within the following set of 2961 symbols:
[:So:]&[:Script=Common:]&[[:Line_Break=AI:][:Line_Break=AL:]]

Changing the class to ID would provide the desired behavior for the problematic characters; the issue is to determine which of the 2961 symbols in the above set should have their class changed. For any given subset there may be arguments for or against making such a change; important general considerations include the following:

The set of symbols is initially adjusted as follows to address some of these concerns:

This reduces the initial set to 2857 symbols as follows:
[:So:]&[:Script=Common:]&[[:Line_Break=AL:][:Line_Break=AI:]]&[:ID_Continue=no:]&[:Math=no:]

From that set are further excluded characters that are (or are similar to) text-like and letter-like symbols, arrows, geometric shapes, fleurons, and symbols that are more abstract or non-pictographic (admittedly a gray area), e.g. alchemical symbols.

The proposed exclusions—characters whose line break class would not change—are the 1504 characters shown in the Main Exclusion Set.

The proposed inclusions are the 921 pictographic characters shown in the Main Inclusion Set.

The UTC would like feedback about the following four sets not part of either set above:

The UTC’s current recommendation is to the change the LineBreak property value to ID for characters in just the Main Inclusion Set and the Regional Indicator Set. This recommendation is reflected in an updated version of the LineBreak.txt file, LineBreakPRI229.txt, which is posted to make comparison with existing values and testing of implementations of this proposal easier. If the UTC's current recommendation is approved, LineBreak.txt in the UCD would be updated as shown.