Line breaking changes for pictographic symbols: Background

L2/12-295

PRI #229: Line breaking changes for pictographic symbols: Background

Emoji are now being used in many contexts other than Japanese e-mail and text messages. The current line break property value of most Unicode characters used for emoji is AL (alphabetic letter), which is causing problems in many of these contexts. For example, here are short lines (adapted from examples promoting an emoji application) mixing Latin characters and emoji, without spaces; there is no line break opportunity anywhere in the sentences:

Bad line break in short lines with emoji

A Japanese version of the example above would be less problematic because most Japanese characters have line break property value ID and could break before or after emoji. However, long strings of emoji are still a problem, because there is no break opportunity in the string.

This is a special case of a more general problem with the LineBreak class assignments for many symbols of GeneralCategory=So and Script=Common. Nearly all such symbols have one of the three following LineBreak classes (the number of such symbols in the category is given in parentheses):

AL: Ordinary Alphabetic and Symbol Characters - (2493) Require other characters to provide break opportunities; otherwise no breaks are allowed between them. UAX #14 already notes that at least in East Asian contexts it may be desirable to allow breaks especially between Latin letters and symbols.
ID: Ideographic - (380) Do not require other characters to provide break opportunities; lines can ordinarily break before and after and between pairs of ideographic characters. However, as with letters, breaks are not allowed between opening parentheses/quotes or prefix numeric and ID, nor between ID and closing parentheses/quotes, closing punctuation, postfix numeric, ellipses, nonstarters, etc.
AI: Ambiguous (Alphabetic or Ideograph) - (468) Can be tailored to behave like AL or ID, but the default is AL.

A few (11) other So/Common symbols have special LineBreak classes such Exclamation, Punctuation, etc.

Characters with problematic line breaking are thus within the following set of 2961 symbols:
[:So:]&[:Script=Common:]&[[:Line_Break=AI:][:Line_Break=AL:]]

Changing the class to ID would provide the desired behavior for the problematic characters; the issue is to determine which of the 2961 symbols in the above set should have their class changed. For any given subset there may be arguments for or against making such a change; important general considerations include the following:

Backward compatibility: Changes should only be made where they are unlikely to cause a serious problem.
Consistency: In order for line break behavior to make sense to users, any such change should be applied consistently to sets of characters that users are likely to perceive as similar.
Highly pictographic symbols, such as the non-terminal pictographic symbols in the example above, are the ones for which such a change is most needed.

The set of symbols is initially adjusted as follows to address some of these concerns:

Any character with ID_Continue=yes; that excludes 212E ESTIMATED SYMBOL in Letterlike Symbols from this set. (More generally, symbols attached to text such as 00A9 COPYRIGHT SIGN, and letter-like symbols generally, are further excluded below)
Any character with Math=yes; we don’t want to affect breaking of math. (Since the Math set includes many arrows and geometrical shapes, for consistency all arrows and geometrical shapes are further excluded below)

This reduces the initial set to 2857 symbols as follows:
[:So:]&[:Script=Common:]&[[:Line_Break=AL:][:Line_Break=AI:]]&[:ID_Continue=no:]&[:Math=no:]

From that set are further excluded characters that are (or are similar to) text-like and letter-like symbols, arrows, geometric shapes, fleurons, and symbols that are more abstract or non-pictographic (admittedly a gray area), e.g. alchemical symbols.

The proposed exclusions—characters whose line break class would not change—are the 1504 characters shown in the Main Exclusion Set.

The proposed inclusions are the 921 pictographic characters shown in the Main Inclusion Set.

The UTC would like feedback about the following four sets not part of either set above:

Musical symbols and related emoji, 189 characters in the Musical Symbols Set.
Simple enclosed and square letter dingbats, 182 characters in the Enclosed Set.
Complex enclosed and square letter dingbats (multiple letters and/or complex background), 35 characters in the Complex Set.
Regional Indicators, 26 character in the Regional Indicator Set.

The UTC’s current recommendation is to the change the LineBreak property value to ID for characters in just the Main Inclusion Set and the Regional Indicator Set. This recommendation is reflected in an updated version of the LineBreak.txt file, LineBreakPRI229.txt, which is posted to make comparison with existing values and testing of implementations of this proposal easier. If the UTC's current recommendation is approved, LineBreak.txt in the UCD would be updated as shown.