Proposed Draft Unicode Technical Report #36

Security Considerations for the Implementation of Unicode and Related Technology

Version	1.0
Authors	Mark Davis (mark.davis@us.ibm.com)
Date	2004-10-12
This Version	http://www.unicode.org/reports/tr36/tr36-1.html
Previous Version	n/a
Latest Version	http://www.unicode.org/reports/tr36/
Tracking Number	1

Summary

This document describes security considerations that are important to be aware of when working with Unicode.

Status

This document is a proposed draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

1. Introduction
2. Canonical Representation
3. Visual Spoofing
- 3.1. International Domain Names
Acknowledgements
References
Modifications

Note to Reviewers: The original working title was "Unicode Security Considerations". Should the above title be changed back to that, or changed to something else? Feedback is welcome.

1. Introduction

Unicode represents a very significant advance over all previous methods of encoding characters. For the first time, all of the worlds characters could be represented in a uniform manner, for the first time making it feasible for the vast majority of programs to be globalized: to handle any language in the world.

In many ways, the use of Unicode makes programs much more robust and secure. When systems need to use a hodge-podge of different charsets for representing characters, it was possible to take advantage of differences between those charsets, or in the way in which programs converted to and from them.

However, because Unicode contains such a large number of characters, and because it incorporates, incorrect usage can expose programs or systems to possible security attacks. This document describes some of the security considerations that should be taken into account by programmers, system analysts, standards-developers, and others.

We anticipate that this document will grow over time, adding additional sections as needed. Initially, there are two areas that will be discussed: canonical representation and visual spoofing. For more information, see also the Unicode FAQ on Security Issues.

Note to Reviewers: Some of the examples below use Unicode characters which some browsers will not show, or at least will not show in a way that illustrates the problem. You can open up a screen shot of the examples that demonstrates the issues, using a common browser. In the final version of this document, we will probably replace each examples by the corresponding image.

2. Canonical Representation

A common practice is to have a 'gatekeeper' for a system. That gatekeeper checks over incoming data to ensure that it is safe, and passes only safe data through. Once in the system, the other components assume that the data is safe. A problem arises when a component treats two pieces of text as identical -- typically by canonicalizing them to the same form -- while the gatekeeper only detected that one of them was unsafe.

UTF-8 Exploit

There are three equivalent encoding forms for Unicode: UTF-8, UTF-16, and UTF-32. UTF-8 is commonly used in XML and HTML; UTF-16 is the most common in program APIs; and UTF-32 is the best for representing single characters. While these forms are all equivalent in terms of the ability to express Unicode, the original usage of UTF-8 was open to a canonicalization exploit.

Up to The Unicode Standard, Version 3.0 the generation of "non-shortest form" UTF-8 was forbidden, and as was the interpretation of illegal sequences, but not the interpretation of "non-shortest form". Where software does interpret the non-shortest forms, security issues can arise. For example:

Process A performs security checks, but does not check for non-shortest forms.
Process B accepts the byte sequence from process A, and transforms it into UTF-16 while interpreting non-shortest forms.
The UTF-16 text may then contain characters that should have been filtered out by process A.

For example, the backslash character "\" can often be a dangerous character to let through a gatekeeper, since it can be used to access different directories. Thus a gatekeeper might specifically prevent it from getting through. The backslash is represented in UTF-8 as the byte sequence <5C>. However, as a non-shortest form, backslash could also be represented as the byte sequence<C1 9C>. When a gatekeeper doesn't catch that, but a component converts non-shortest forms, it can allow a real security breech. For more information, see http://www.microsoft.com/technet/security/bulletin/MS00-078.mspx and http://www.ins.com/downloads/whitepapers/ins_white_paper_ms_iis_unicode_exploit_0801.pdf.

To address this issue, the Unicode Technical Committee modified the definition of UTF-8 in Unicode 3.1 to forbid conformant implementations from interpreting non-shortest forms for BMP characters, and clarified some of the conformance clauses.

Recommendation: Ensure that all implementations of UTF-8 used in a system are conformant to the latest version of Unicode.

To Do: add information about other possible exploits in this area:

Unicode Normalization
Case mapping
Buffer overflows with all of the above, and when converting encoding forms

3. Visual Spoofing

Visual spoofing is where a similarity in visual appearance fools a user, and causes him or her to take unsafe actions. This is not new to Unicode: it was possible to spoof simply with ASCII character: "inteI.com" for example, uses a capital I instead of an L. The infamous example here is of course "paypaI.com":

... Not only was "Paypai.com" very convincing, but the scam artist even goes one step further. He or she is apparently emailing PayPal customers, saying they have a large payment waiting for them in their account.

The message then offers up a link, urging the recipient to claim the funds. But the URL that is displayed for the unwitting victim uses a capital "i" (I), which looks just like a lowercase "L" (l), in many computer fonts. ...

-- Beware the 'PaypaI' scam

And the spoofs nowadays are pretty clever. One is an email that looks like it comes from a trusted source, like your bank. It even has an explicit disclaimer to not trust links in email, and directs you to copy text to your address bar in your browser. The text looks ok to you, so you won't realize that you are going to a completely different site, which is then set up to simulate your bank well enough to get your password.

To a certain extent, the new forms of spoofing available with Unicode are a matter of degree and not kind. However, because of the very large number of Unicode characters (over 94,000 in the current version), the number of opportunities for visual spoofing are significantly larger than with a restricted character set such as ASCII.

International Domain Names

Spoofing is an especially important subject given the recent introduction of international domain names (IDN). There is a natural desire for people to see domain names in their own languages and writing systems; English speakers can understand this if they consider what it would be like if they always had to type web addresses with Russian characters! So IDN represents a very significant advance for most people in the world. Yet the avoidance of spoofing vulnerability requires proper implementation in browsers and other programs.

Fortunately, there is a bit of breathing space, while new international domain names and programs using them have been widely deployed. However, unless people take security considerations into account in their designs, this will soon lead to problems.

International domain names are, of course, not the only cases where visual spoofing can occur. For example, you might get a message asking you to allow allowing the installation of software from "IBM", authenticated with the proper Verisign certificate, but the "M" character happens to be the Russian (Cyrillic) character that looks precisely like the English "M". However, IDN provides a good starting point for a discussion of visual spoofing.

The good news is that the design of IDN prevents a huge number of spoofing attacks. All conformant users of IDN are required to process domain names to convert compatibility-equivalent characters into a unique form; this processing eliminates most of the possibilities for visual spoofing by mapping away a large number of visually confusable characters and sequences. For example, Unicode contains the "ä" (a-umlaut) character, but also contains a free-standing umlaut ("¨") which can be used in combination with any character, including an "a". But the compatibility normalization will convert any sequence of "a" plus "¨" into the regular "ä".

Thus you can't spoof an a-umlaut with a + umlaut; it simply results in the same domain name. See example 1 below. The String column shows the actual characters; the UTF-16 shows the underlying encoding, while the IDNA column shows the IDNA format used to represent the string internally in International Domain Names.

**Safe Domain Names**
	String	UTF-16	IDNA
1a	ät.com	0061 0308 0074 002E 0063 006F 006D	xn--t-zfa.com
1b	ät.com	00E4 0074 002E 0063 006F 006D	xn--t-zfa.com

Note: The ICU demo at http://oss.software.ibm.com/cgi-bin/icu/idnademo can be used to demonstrate the results of processing different domain names. That demo was also used to get the IDNA values shown here.

However, there remain many cases where visual spoofing can still occur with international domain names.

Cross-Script Spoofing

Visually similar characters are not usually unified across scripts. Thus a Greek omicron is encoded as a different character from the Latin "o", even though it is usually identical in appearance. This means that there is a significant number of spoofing possibilities using characters from different scripts. For example, a domain name can be spoofed by using a Greek omicron instead of an 'o', as in example 2a.

**Script Spoofing**
	String	UTF-16	IDNA
2a	tοp.com	0074 03BF 0070 002E 0063 006F 006D	xn--tp-jbc.com
2b	`tοp.com`	0074 006F 0070 002E 0063 006F 006D	top.com

However, there are many legitimate uses of mixed scripts. It is quite common, for example, to use English words (with Latin characters) in the middle of other languages using other scripts. This often happens with product names, for example, such as "Sony".

Recommendation: the user should be alerted to these cases by displaying mixed scripts with some special formatting to alert the user to the situation. For example, a different color and special boundary marks, are used in Example 2c. A tool-tip can be displayed when the user moves the mouse over the address to display more information about the situation.

tοp.com

0074 03BF 0070 002E 0063 006F 006D

xn--tp-jbc.com

The Unicode Standard supplies information that can be used for detecting mixed-script text: for more information, see UAX 24 Script Names.

In-Script Spoofing

However, while compatibility normalization and mixed-script detection can handle the vast majority of cases, there are other visual confusables that could cause problems. With fonts increasing able to handle international characters, and especially with smaller font sizes in the context of an address bar, these visual confusables could be used to spoof. Importantly, these problems can be illustrated with common, widely available fonts on widely available operating systems -- this is not pointing a finger at any one vendor.

Consider the following examples, all in the same script. In each numbered case, in commonly available browsers, the strings will look identical.

**Spoofed Domain Names**
	String	UTF-16	IDNA
3a	a‐b.com	0061 2010 0062 002E 0063 006F 006D	xn--ab-v1t.com
3b	a-b.com	0061 002D 0062 002E 0063 006F 006D	a-b.com

4a	so̷s.com	0073 006F 0337 0073 002E 0063 006F 006D	xn--sos-rjc.com
4b	søs.com	0073 00F8 0073 002E 0063 006F 006D	xn--ss-lka.com

5a	z̵o.com	007A 0335 006F 002E 0063 006F 006D	xn--zo-pyb.com
5b	ƶo.com	01B6 006F 002E 0063 006F 006D	xn--o-zra.com

6a	an͂o.com	0061 006E 0342 006F 002E 0063 006F 006D	xn--ano-0kc.com
6b	año.com	0061 00F1 006F 002E 0063 006F 006D	xn--ao-zja.com

7a	Đo.org	0110 006F 002E 006F 0072 0067	xn--o-kia.org
7b	Ɖo.org	0189 006F 002E 006F 0072 0067	xn--o-40a.org

Note to Reviewers: In addition to this proposed draft, it is proposed that one of the Unicode Technical Committees consider a project to gather data on sequences of characters that are visually confusable under common fonts, and get outside companies and organizations who would have an interest in this work to join in these efforts. This would provide data for browsers and other products to alert users as to potential problems, as above.

Feedback (positive or negative) on the usefulness of this project would be appreciated.

Inadequate Rendering Support

An additional problem arises when a font and/or rendering engine has inadequate support for certain sequences of characters. These are characters that should be visually distinguishable, but don't appear that way. In example 8a, the a-umlaut is followed by another umlaut. The Unicode Standard guidelines indicate that the second umlaut should be 'stacked' above the first, producing a distinct visual difference. But as this example shows, common fonts will simply superimpose the second umlaut; and if the positioning is close enough, the user will not see a difference between 8a and 8b.

Inadequate Rendering Support
	String	UTF-16	IDNA
8a	ä̈t.com	00E4 0308 0074 002E 0063 006F 006D	xn--t-zfa85n.com
8b	ät.com	00E4 0074 002E 0063 006F 006D	xn--t-zfa.com

9a	eḷ.com	0065 006C 0323 002E 0063 006F 006D	xn--e-zom.com
9b	ẹl.com	0065 0323 006C 002E 0063 006F 006D	xn--l-ewm.com
9c	ẹl.com	1EB9 006C 002E 0063 006F 006D	xn--l-ewm.com

In example 9, we have an even worse case. The underdot character in 9a is actually under the 'l', but with this font, it appears as under the 'e'! It is thus visually confusable with 9b (where the underdot is under the e) or the equivalent normalized form 9c.

Recommendation: Browsers and similar programs should follow the Unicode Standard guidelines to avoid spoofing problems. There is a technical note, UTN #2: Rendering Combining Marks, which provides information as to how this can be implemented even in the absence of font support.

To Do: add discussions of:

Other applications of visual spoofing, aside from the example of IDN. International domain names are actually in much better shape than many other areas, since the problem will be much more severe in any area where text is not normalized. So focus on those issues.
Bidirectional visual spoofs
Guidelines for registrars of identifiers subject to spoofing, and for displayers of identifiers. For IDNA, the latter two would be NICs and browsers.
Unicode properties. Eg more characters have numeric properties than developers might expect.
Use of Regular Expressions in validating data -- ensuring that the Regular Expression Engine follows the Unicode Guidelines, but also that use of regular expressions makes use of properties rather than fixed lists of characters.
Comparison and sorting
Discuss and/or point to other items:
- http://www.nextgenss.com/papers/unicodebo.pdf
- http://www.cs.technion.ac.il/~gabr/papers/homograph_full.pdf

Note to Reviewers: additional topics and links would be appreciated.

Acknowledgements

Steven Loomis and other people on the ICU team were very helpful in developing the original proposal for this technical report. Thanks also to the following people for their feedback or contributions to this document: Martin Dürst, Paul Hoffman.

References

To Do: comb through the text and convert the references to the standard form.

[Feedback]	Reporting Errors and Requesting Information Online http://www.unicode.org/reporting.html
Reports]	Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[UCD]	Unicode Character Database. http://www.unicode.org/ucd/For an overview of the Unicode Character Database and a list of its associated files
[Unicode]	The Unicode Consortium. The Unicode Standard, Version 4.0. Reading, MA, Addison-Wesley, 2003. 0-321-18578-1.
[Versions]	Versions of the Unicode Standard http://www.unicode.org/standard/versions/ For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.

Modifications

The following summarizes modifications from the previous version of this document.

1	Initial version, following proposal to UTC Incorporated comments, restructured, added To Do items

Copyright © 2000-2004 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.