PRI #215: Regional Indicator Symbol Segmentation and Conversion

The Unicode Technical Committee is planning to change the way that sequences of REGIONAL INDICATOR SYMBOL characters (U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER AU+1F1FF REGIONAL INDICATOR SYMBOL LETTER Z) are used, to address problems in segmentation and the implementation of character encoding conversion.

Pairs of these REGIONAL INDICATOR SYMBOLs (RIS) are used to represent emoji flag symbols. Unfortunately, in a sequence of more than two of these characters, it is unclear where the segmentation boundaries lie, especially for grapheme cluster, word, and linebreak. Even for a single pair of characters, the normal grapheme cluster and word algorithms do not produce the expected results.

Segmentation implementations would have to scan all the way back to the first character in such sequences so as to break the sequence correctly. If the sequence is broken incorrectly, results may include: mojibake (incorrect glyph display), incorrect cursor movements, deletion behavior, and so on. Any other processing that uses random access will run into similar problems.

To address these problems, the Unicode Technical Committee is planning to use the U+200D ZERO WIDTH JOINER to mark pairs of regional indicator symbols that should be processed as single units. Each sequence <RIS, Joiner, RIS> would constitute a triple that would be mapped to an emoji flag. This requires changes to text segmentation algorithms and to encoding conversion tables for emoji characters. The proposed changes in rules and properties for text segmentation are in the proposed update documents for UAX #14 and UAX #29. The corresponding property value changes are incorporated in the data files available for beta review for the UCD for Unicode 6.2. See LineBreak.txt for the changes in the Line_Break property values, and the auxiliary subdirectory for the changes for Grapheme_Cluster_Break and Word_Break.

The changes in Grapheme Cluster and Word segmentation are limited to just these triples. The changes in Line segmentation affect other characters and need special review. In particular, the Joiner character no longer behaves like a CM.

There can be degenerate sequences like <RIS, Joiner, RIS, Joiner, RIS>. According to the proposed changes, this sequence would not break, but its interpretation would be ambiguous.

The UTC also welcomes feedback offering alternative approaches that might address this issue in a different manner.