PRI #299 Background: Representing Additional Types of Flags

Last updated: 2015-06-29

A. Background

The Unicode Standard already provides a mechanism to represent flags using pairs of REGIONAL INDICATOR symbols U+1F1E6..U+1F1FF, which were added in Unicode 6.0. The mechanism is documented in the current text of the standard and covered in Annex B of UTR #51.

On several systems, pairs of REGIONAL INDICATOR symbols are used to represent up to more than 200 flags as emoji. These pairs correspond to unicode_region_subtag two-letter codes, which can represent some regions such as Isle of Man, Guernsey, and Puerto Rico but not others, such as England, Scotland, Wales, U.S. states, or Russian Federation republics.

The unicode_region_subtags defined in CLDR are based on BCP47, which is in turn based on ISO 3166-1 and UN M49 codes.

On some platforms that support a number of emoji flags, there is substantial demand to support additional flags beyond those defined for unicode_region_subtags, such as for the following:

B. Proposal

This proposal describes a mechanism for representing unicode_region_subtags and unicode_subdivision_subtags with TAG characters for designating flags. The proposal will extend UTR #51 and includes several parts:

  1. Use the TAG characters E0030..E0039 (TAG DIGITs) and E0041..E005A (TAG LATIN CAPITAL LETTERs). These TAG characters are default-ignorable, without any visible representation by themselves.
  2. Designate the character U+1F3F3 WAVING WHITE FLAG as the “base” for a subsequent sequence of TAG characters. This character already encoded, with general category value So.
  3. Define valid sequences as the base character followed by a sequence of TAG characters, as specified by either of the following conditions:
  4. Provide guidelines and constraints for the use of the TAG sequences, to help ensure stable and non-redundant representation of regions and regional subdivisions:

C. Syntax

The syntax for well-formed subdivision flags is:

B((TL{2} (TL|TD){1,4}) | (TD{3} (TL|TD){1,4}?))

This uses the following notation:

Bdesignates the chosen base character (U+1F3F3)
TLdesignates a TAG LATIN CAPITAL LETTER (A..Z)
TDdesignates a TAG DIGIT (ZERO..NINE)

Not all syntactically well-formed TAG sequences correspond to an actual flag—only a defined subset can be used.

D. Text break considerations

The TAG characters have general category value Cf and line break property value CM. Consequently, the proposed base character followed by a sequence of TAG characters is already treated as a unit for word, sentence, and line break. Grapheme break property values and rules would need some adjustment; until those are updated in UAX #29, implementations could use a tailored grapheme break to handle these correctly.

The proposal will add language to UTR #51 recommending that each REGIONAL INDICATOR pair used to designate a flag be followed by U+200C ZERO WIDTH NON-JOINER (ZWNJ) to facilitate text break. ZWNJ is in the Extend class for grapheme and word break, and will thus be included in a grapheme or word with the preceding REGIONAL INDICATORs.

E. Discussion

  1. Note that TAG sequences could also be used to designate flags corresponding to two-letter unicode_region_subtags, using the base character followed by two TAG LATIN CAPITAL LETTERs. This alternative to the use of paired REGIONAL INDICATOR SYMBOL letters to designate unicode_region_subtags has better inherent behavior for text break. However, doing so would result in two possible representations for many flags, so is not recommended. Note that the TAG sequences do allow for 3-digit region codes for the case where ISO destabilizes codes, by allowing the use of the three digit forms from BCP47.
  2. Instead of using U+1F3F3 WAVING WHITE FLAG as the base for a TAG sequence, an alternative possibility is encoding a new character, perhaps U+1F1E5 REGIONAL FLAG BASE. Encoding a new character would delay support for the desired flags until it could be encoded (and the character alone would still need some sort of representation as a flag), so is not recommended.
  3. A special fallback appearance should be used for the base followed by any unsupported or invalid sequence of TAG characters. The recommended glyph for the fallback is U+1F3F3 WAVING WHITE FLAG in a dotted rectangle.
  4. Use of UN M49 codes to designate flags for supra-national and international organizations requires additional guidelines. For many M49 codes that designate supra-national regions, there is no reasonable flag; for others there are several possibilities, but all may have some political issues. For example:
  5. It is not anticipated—by any means—that flags for all or even most subdivision codes would be supported. Many subdivisions don’t have flags, or don’t have widely recognized flags. We would expect that certainly initially, and perhaps long term, only a relatively small number of subdivision flags would be widely supported and deployed.

Appendix: Material for CLDR 28 LDML specification

The following material has already been added to a draft version of UTS #35, the Unicode LDML specification, for CLDR version 28; it may be refined before the release of CLDR 28 in September 2015. The subdivisionContainment data mentioned below will also be in the CLDR 28 file in subdivisions.xml. This preliminary material is included here for reference only and is not part of Public Review Issue 299; feedback on this preliminary material can be provided as described in the Status section of the UTS #35 draft version.



 
EBNF
ABNF
...
unicode_subdivision_subtag = (alphanum{1,4} ; = 1*4alphanum
...

3.6.5 Subdivision Codes

The subdivision codes are based on ISO 3166-2 codes, which have 1..3 ASCII letters or digits. Like BCP47, CLDR needs stable codes, which are not guaranteed for ISO 3166-2 (nor have they been stable in the past).

CLDR thus adds 4-character sequences, also ASCII letters or digits, which can be used for stability. If an ISO 3166-2 code is removed, it remains valid (though marked as deprecated) in CLDR. If an ICU 3166-2 code is reused for a different subdivision (within the same region), then CLDR will define a new equivalent code using these 4-character sequences.

...

A unicode_subdivision_subtag is valid for a unicode_region_subtag only when the subdivisionContainment element contains a subgroup element where:

  1. the type attribute value is that unicode_region_subtag, and
  2. the contains attribute value contains the unicode_subdivision_subtag.

For example, the subdivision “ca” (and “CA”) is valid for the region “US” because of the following element:

<subgroup type="US" contains="… CA …"/>

...