L2/12-142
Source: Mark Davis
Date: April 27, 2012
Subject: SJIS Trap with regional indicators

Unfortunately, the encoding of the REGIONAL INDICATOR SYMBOLs fell into the SJIS trap.

With a series of regional indicators, you have no idea where the boundaries lie unless you scan back to the first one. This is not idle speculation; it does happen that people list a series of flags, and we've run into this issue. Consider the following case, where each uppercase letter represents the corresponding regional indicator (this case is for illustration; there can be longer or shorter sequences):

VUGDECVNUZWSVIDZWFREETVGUYESDMVEERUSDOMGUMVCVADEEHUGUADKEGEE

Any pair in this sequence is a valid country code. That is, if you break these into pairs VU UG GE EE ET ..., each pair is valid. 

VU=Vanuatu; GD=Grenada; EC=Ecuador; VN=Vietnam; UZ=Uzbekistan; WS=Samoa; VI=U.S. Virgin Islands; DZ=Algeria; WF=Wallis and Futuna; RE=Réunion; ET=Ethiopia; VG=British Virgin Islands; UY=Uruguay; ES=Spain; DM=Dominica; VE=Venezuela; ER=Eritrea; US=United States; DO=Dominican Republic; MG=Madagascar; UM=U.S. Minor Outlying Islands; VC=Saint Vincent and the Grenadines; VA=Vatican City; DE=Germany; EH=Western Sahara; UG=Uganda; UA=Ukraine; DK=Denmark; EG=Egypt; EE=Estonia

But if you removed the first V and break into pairs, each pair is also valid, but different! 

UG=Uganda; DE=Germany; CV=Cape Verde; NU=Niue; ZW=Zimbabwe; SV=El Salvador; ID=Indonesia; ZW=Zimbabwe; FR=France; EE=Estonia; TV=Tuvalu; GU=Guam; YE=Yemen; SD=Sudan; MV=Maldives; EE=Estonia; RU=Russia; SD=Sudan; OM=Oman; GU=Guam; MV=Maldives; CV=Cape Verde; AD=Andorra; EE=Estonia; HU=Hungary; GU=Guam; AD=Andorra; KE=Kenya; GE=Georgia; dangling "E".

In line breaking, you have to scan all the way to the very first character "V" in order to not break the string incorrectly. If the line is broken incorrectly, we have mojibake (and worse; subsequent lines look correct, but display the wrong characters). Any other processing that uses random access will run into the same problem. The text is also fragile; remove the first character and you change the interpretation of all the remaining characters.

One of the encoding principles of Unicode encoding forms was "no overlap": we neglected this principle with the RIs, and didn't catch it in time.

We can think of a couple of options to address this; there may be more that we haven't considered, so this needs discussion at the UTC to come up with the best approach.

  1. Add a duplicate set of REGIONAL INDICATOR TRAIL characters, and encode flags with pairs of < REGIONAL INDICATOR SYMBOL ..., REGIONAL INDICATOR TRAIL ...>. While something like this should have been done originally, it's too late.
  2. Specify that no sequence of 3 or more REGIONAL INDICATOR SYMBOLs is valid. That is, any sequence would require a separator, like ZWSP. Encoding converters would need to insert ZWSP when converting from legacy encodings (like SJIS) to Unicode, and remove ZWSP when converting to SJIS.
  3. Recommend #2 as a technique, but don't require it.