L2/04-305

Unicode Security Considerations

Mark Davis, Steven Loomis, 2004-07-23 (draft 2)

We would propose that a new Unicode Technical Report describe Unicode Security Considerations and present guidelines for dealing with them. The first draft of the report would initially consist of two main topics: Unicode encoding forms and visual spoofing, based on the following rough-draft text.

Unicode encoding forms

[Ed note: describe the security considerations behind the tightening of the Unicode encoding form definitions (non-shortest form attacks), and what the implications are of mis-identified Unicode encodings.]

Visual Spoofing

This section discusses the area of visual spoofing, starting with issues involved in International Domain Names.

The good news is that because compatibility normalization is built into international domain names (IDNA), most of the possibilities for spoofing are eliminated. This should not be surprising, since one motivation for the equivalences used in normalization is to identify visually confusable characters and sequences. For example, you can't spoof an a-umlaut with a + umlaut; it simply results in the same domain name, because IDNA uses normalization. See example 1 below.

Note: If the browser in your font doesn't show the examples listed below, here is a screen shot of the examples.

**Safe Domain Names**
	String	UTF-16	IDNA
1a	ät.com	00E4 0074 002E 0063 006F 006D	xn--t-zfa.com
1b	ät.com	0061 0308 0074 002E 0063 006F 006D	xn--t-zfa.com

Note: The ICU demo at http://oss.software.ibm.com/cgi-bin/icu/idnademo can be used to test different IDNAs.

Cross-Script Spoofing

Because visually similar characters are not usually unified across scripts, there are a large number of spoofing possibilities using characters from different scripts. For example, a domain name can be spoofed by using a Greek omicron instead of an 'o', as in example 2a. However, the user can be alerted to these cases by displaying mixed scripts with different color, highlighting, or boundary marks, such as in example 2c. And the Unicode Standard supplies information that can be used to this end: for more information, see UAX 24 Script Names. Note that such cases should only be flagged, not forbidden, since there are some legitimate cases of mixed scripts.

**Script Spoofing**
	String	UTF-16	IDNA
2a	tοp.com	0074 03BF 0070 002E 0063 006F 006D	xn--tp-jbc.com
2b	`tοp.com`	0074 006F 0070 002E 0063 006F 006D	top.com
2c	tοp.com	0074 03BF 0070 002E 0063 006F 006D	xn--tp-jbc.com

In-Script Spoofing

However, while compatibility normalization and script run detection can handle the vast majority of cases, there are other visual confusables that could cause problems. With fonts increasing able to handle international characters, and especially with smaller font sizes in the context of an address bar, these visual confusables could be used to spoof. Importantly, these problems can be illustrated with common, widely available fonts on widely available operating systems -- this is not pointing a finger at any one vendor.

Consider the following examples, all in the same script.

**Spoofed Domain Names**
	String	UTF-16	IDNA
3a	a‐b.com	0061 2010 0062 002E 0063 006F 006D	xn--ab-v1t.com
3b	a-b.com	0061 002D 0062 002E 0063 006F 006D	a-b.com

4a	so̷s.com	0073 006F 0337 0073 002E 0063 006F 006D	xn--sos-rjc.com
4b	søs.com	0073 00F8 0073 002E 0063 006F 006D	xn--ss-lka.com

5a	z̵o.com	007A 0335 006F 002E 0063 006F 006D	xn--zo-pyb.com
5b	ƶo.com	01B6 006F 002E 0063 006F 006D	xn--o-zra.com

6a	an͂o.com	0061 006E 0342 006F 002E 0063 006F 006D	xn--ano-0kc.com
6b	año.com	0061 00F1 006F 002E 0063 006F 006D	xn--ao-zja.com

7a	Đo.org	0110 006F 002E 006F 0072 0067	xn--o-kia.org
7b	Ɖo.org	0189 006F 002E 006F 0072 0067	xn--o-40a.org

And the spoofs nowadays are pretty clever. One is an email that looks like it comes from a trusted source, like your bank. It even has an explicit disclaimer to not trust links in email, and directs you to copy text to your address bar in your browser. The text looks ok to you, so you won't realize that you are going to a completely different site, which is then set up to simulate your bank well enough to get your password.

Inadequate Rendering Support

An additional problem arises when a font and/or rendering engine has inadequate support for certain sequences of characters. These are characters that should be visually distinguishable, but don't appear that way. In example 8a, the a-umlaut is followed by another umlaut. The Unicode Standard guidelines indicate that the second umlaut should be 'stacked' above the first, producing a distinct visual difference (UTN #2: Rendering Combining Marks provides information as to how this can be implemented by a rendering information even in the absence of font support). But as this example shows, common fonts will simply superimpose the second umlaut; and if the positioning is close enough, the user will not see a difference between 8a and 8b.

Inadequate Rendering Support
	String	UTF-16	IDNA
8a	ä̈t.com	00E4 0308 0074 002E 0063 006F 006D	xn--t-zfa85n.com
8b	ät.com	00E4 0074 002E 0063 006F 006D	xn--t-zfa.com

9a	eḷ.com	0065 006C 0323 002E 0063 006F 006D	xn--e-zom.com
9b	ẹl.com	0065 0323 006C 002E 0063 006F 006D	xn--l-ewm.com
9c	ẹl.com	1EB9 006C 002E 0063 006F 006D	xn--l-ewm.com

In example 9, we have an even worse case. The underdot character in 9a is actually under the 'l', but with this font, it appears as under the 'e'! It is thus visually confusable with 9b (where the underdot is under the e) or the equivalent normalized form 9c.

Other Aspects

Of course, it was possible to spoof simply with ASCII character: "inteI.com" for example, uses a capital I instead of an L. The infamous example here is of course "paypaI.com": see http://news.zdnet.co.uk/internet/security/0,39020375,2080344,00.htm

... Not only was "Paypai.com" very convincing, but the scam artist even goes one step further. He or she is apparently emailing PayPal customers, saying they have a large payment waiting for them in their account.

The message then offers up a link, urging the recipient to claim the funds. But the URL that is displayed for the unwitting victim uses a capital "i" (I), which looks just like a lowercase "L" (l), in many computer fonts. ...

So to a certain extent, the new forms of spoofing available with Unicode are a matter of degree and not kind. In addition, there is a certain window for addressing this problem in implementations, before international domain names become truly widespread.

International domain names are not the only cases where the above spoofing problems can occur. For example, you might get a message asking you to allow allowing the installation of software from "IBM", authenticated with the proper Verisign certificate, but the "M" is Cyrillic. International domain names are actually in much better shape than many other areas, since the problem will be much more severe in any area where text is not normalized.

[Ed note: add more here about other areas, also guidelines for registrars of identifiers subject to spoofing, and for displayers of such identifiers. For IDNA, the latter two would be NICs and browsers.]

In addition to the Unicode Technical Report, we would propose that one of the Unicode Technical Committees consider a project to gather data on sequences of characters that are visually confusable under common fonts, and get outside companies and organizations who would have an interest in this work to join in these efforts.