|Authors||Mark Davis (firstname.lastname@example.org)|
This document describes security considerations that are important to be aware of when working with Unicode.
This document is a proposed draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
Note to Reviewers: The original working title was "Unicode Security Considerations". Should the above title be changed back to that, or changed to something else? Feedback is welcome.
Unicode represents a very significant advance over all previous methods of encoding characters. For the first time, all of the worlds characters could be represented in a uniform manner, for the first time making it feasible for the vast majority of programs to be globalized: to handle any language in the world.
In many ways, the use of Unicode makes programs much more robust and secure. When systems need to use a hodge-podge of different charsets for representing characters, it was possible to take advantage of differences between those charsets, or in the way in which programs converted to and from them.
However, because Unicode contains such a large number of characters, and because it incorporates, incorrect usage can expose programs or systems to possible security attacks. This document describes some of the security considerations that should be taken into account by programmers, system analysts, standards-developers, and others.
We anticipate that this document will grow over time, adding additional sections as needed. Initially, there are two areas that will be discussed: canonical representation and visual spoofing. For more information, see also the Unicode FAQ on Security Issues.
Note to Reviewers: Some of the examples below use Unicode characters which some browsers will not show, or at least will not show in a way that illustrates the problem. You can open up a screen shot of the examples that demonstrates the issues, using a common browser. In the final version of this document, we will probably replace each examples by the corresponding image.
A common practice is to have a 'gatekeeper' for a system. That gatekeeper checks over incoming data to ensure that it is safe, and passes only safe data through. Once in the system, the other components assume that the data is safe. A problem arises when a component treats two pieces of text as identical -- typically by canonicalizing them to the same form -- while the gatekeeper only detected that one of them was unsafe.
There are three equivalent encoding forms for Unicode: UTF-8, UTF-16, and UTF-32. UTF-8 is commonly used in XML and HTML; UTF-16 is the most common in program APIs; and UTF-32 is the best for representing single characters. While these forms are all equivalent in terms of the ability to express Unicode, the original usage of UTF-8 was open to a canonicalization exploit.
Up to The Unicode Standard, Version 3.0 the generation of "non-shortest form" UTF-8 was forbidden, and as was the interpretation of illegal sequences, but not the interpretation of "non-shortest form". Where software does interpret the non-shortest forms, security issues can arise. For example:
For example, the backslash character "\" can often be a dangerous character to let through a gatekeeper, since it can be used to access different directories. Thus a gatekeeper might specifically prevent it from getting through. The backslash is represented in UTF-8 as the byte sequence <5C>. However, as a non-shortest form, backslash could also be represented as the byte sequence<C1 9C>. When a gatekeeper doesn't catch that, but a component converts non-shortest forms, it can allow a real security breech. For more information, see http://www.microsoft.com/technet/security/bulletin/MS00-078.mspx and http://www.ins.com/downloads/whitepapers/ins_white_paper_ms_iis_unicode_exploit_0801.pdf.
To address this issue, the Unicode Technical Committee modified the definition of UTF-8 in Unicode 3.1 to forbid conformant implementations from interpreting non-shortest forms for BMP characters, and clarified some of the conformance clauses.
Recommendation: Ensure that all implementations of UTF-8 used in a system are conformant to the latest version of Unicode.
To Do: add information about other possible exploits in this area:
Buffer overflows with all of the above, and when converting encoding forms
Visual spoofing is where a similarity in visual appearance fools a user, and causes him or her to take unsafe actions. This is not new to Unicode: it was possible to spoof simply with ASCII character: "inteI.com" for example, uses a capital I instead of an L. The infamous example here is of course "paypaI.com":
... Not only was "Paypai.com" very convincing, but the scam artist even goes one step further. He or she is apparently emailing PayPal customers, saying they have a large payment waiting for them in their account.
The message then offers up a link, urging the recipient to claim the funds. But the URL that is displayed for the unwitting victim uses a capital "i" (I), which looks just like a lowercase "L" (l), in many computer fonts. ...
And the spoofs nowadays are pretty clever. One is an email that looks like it comes from a trusted source, like your bank. It even has an explicit disclaimer to not trust links in email, and directs you to copy text to your address bar in your browser. The text looks ok to you, so you won't realize that you are going to a completely different site, which is then set up to simulate your bank well enough to get your password.
To a certain extent, the new forms of spoofing available with Unicode are a matter of degree and not kind. However, because of the very large number of Unicode characters (over 94,000 in the current version), the number of opportunities for visual spoofing are significantly larger than with a restricted character set such as ASCII.
Spoofing is an especially important subject given the recent introduction of international domain names (IDN). There is a natural desire for people to see domain names in their own languages and writing systems; English speakers can understand this if they consider what it would be like if they always had to type web addresses with Russian characters! So IDN represents a very significant advance for most people in the world. Yet the avoidance of spoofing vulnerability requires proper implementation in browsers and other programs.
Fortunately, there is a bit of breathing space, while new international domain names and programs using them have been widely deployed. However, unless people take security considerations into account in their designs, this will soon lead to problems.
International domain names are, of course, not the only cases where visual spoofing can occur. For example, you might get a message asking you to allow allowing the installation of software from "IBM", authenticated with the proper Verisign certificate, but the "M" character happens to be the Russian (Cyrillic) character that looks precisely like the English "M". However, IDN provides a good starting point for a discussion of visual spoofing.
The good news is that the design of IDN prevents a huge number of spoofing attacks. All conformant users of IDN are required to process domain names to convert compatibility-equivalent characters into a unique form; this processing eliminates most of the possibilities for visual spoofing by mapping away a large number of visually confusable characters and sequences. For example, Unicode contains the "ä" (a-umlaut) character, but also contains a free-standing umlaut ("¨") which can be used in combination with any character, including an "a". But the compatibility normalization will convert any sequence of "a" plus "¨" into the regular "ä".
Thus you can't spoof an a-umlaut with a + umlaut; it simply results in the same domain name. See example 1 below. The String column shows the actual characters; the UTF-16 shows the underlying encoding, while the IDNA column shows the IDNA format used to represent the string internally in International Domain Names.
|1a||ät.com||0061 0308 0074 002E 0063 006F 006D||xn--t-zfa.com|
|1b||ät.com||00E4 0074 002E 0063 006F 006D||xn--t-zfa.com|
Note: The ICU demo at http://oss.software.ibm.com/cgi-bin/icu/idnademo can be used to demonstrate the results of processing different domain names. That demo was also used to get the IDNA values shown here.
However, there remain many cases where visual spoofing can still occur with international domain names.
Visually similar characters are not usually unified across scripts. Thus a Greek omicron is encoded as a different character from the Latin "o", even though it is usually identical in appearance. This means that there is a significant number of spoofing possibilities using characters from different scripts. For example, a domain name can be spoofed by using a Greek omicron instead of an 'o', as in example 2a.
|2a||tοp.com||0074 03BF 0070 002E 0063 006F 006D||xn--tp-jbc.com|
||0074 006F 0070 002E 0063 006F 006D||top.com|
However, there are many legitimate uses of mixed scripts. It is quite common, for example, to use English words (with Latin characters) in the middle of other languages using other scripts. This often happens with product names, for example, such as "Sony".
Recommendation: the user should be alerted to these cases by displaying mixed scripts with some special formatting to alert the user to the situation. For example, a different color and special boundary marks, are used in Example 2c. A tool-tip can be displayed when the user moves the mouse over the address to display more information about the situation.
|2c||tοp.com||0074 03BF 0070 002E 0063 006F 006D||xn--tp-jbc.com|
The Unicode Standard supplies information that can be used for detecting mixed-script text: for more information, see UAX 24 Script Names.
However, while compatibility normalization and mixed-script detection can handle the vast majority of cases, there are other visual confusables that could cause problems. With fonts increasing able to handle international characters, and especially with smaller font sizes in the context of an address bar, these visual confusables could be used to spoof. Importantly, these problems can be illustrated with common, widely available fonts on widely available operating systems -- this is not pointing a finger at any one vendor.
Consider the following examples, all in the same script. In each numbered case, in commonly available browsers, the strings will look identical.
|3a||a‐b.com||0061 2010 0062 002E 0063 006F 006D||xn--ab-v1t.com|
|3b||a-b.com||0061 002D 0062 002E 0063 006F 006D||a-b.com|
|4a||so̷s.com||0073 006F 0337 0073 002E 0063 006F 006D||xn--sos-rjc.com|
|4b||søs.com||0073 00F8 0073 002E 0063 006F 006D||xn--ss-lka.com|
|5a||z̵o.com||007A 0335 006F 002E 0063 006F 006D||xn--zo-pyb.com|
|5b||ƶo.com||01B6 006F 002E 0063 006F 006D||xn--o-zra.com|
|6a||an͂o.com||0061 006E 0342 006F 002E 0063 006F 006D||xn--ano-0kc.com|
|6b||año.com||0061 00F1 006F 002E 0063 006F 006D||xn--ao-zja.com|
|7a||Đo.org||0110 006F 002E 006F 0072 0067||xn--o-kia.org|
|7b||Ɖo.org||0189 006F 002E 006F 0072 0067||xn--o-40a.org|
Note to Reviewers: In addition to this proposed draft, it is proposed that one of the Unicode Technical Committees consider a project to gather data on sequences of characters that are visually confusable under common fonts, and get outside companies and organizations who would have an interest in this work to join in these efforts. This would provide data for browsers and other products to alert users as to potential problems, as above.
Feedback (positive or negative) on the usefulness of this project would be appreciated.
An additional problem arises when a font and/or rendering engine has inadequate support for certain sequences of characters. These are characters that should be visually distinguishable, but don't appear that way. In example 8a, the a-umlaut is followed by another umlaut. The Unicode Standard guidelines indicate that the second umlaut should be 'stacked' above the first, producing a distinct visual difference. But as this example shows, common fonts will simply superimpose the second umlaut; and if the positioning is close enough, the user will not see a difference between 8a and 8b.
|8a||ä̈t.com||00E4 0308 0074 002E 0063 006F 006D||xn--t-zfa85n.com|
|8b||ät.com||00E4 0074 002E 0063 006F 006D||xn--t-zfa.com|
|9a||eḷ.com||0065 006C 0323 002E 0063 006F 006D||xn--e-zom.com|
|9b||ẹl.com||0065 0323 006C 002E 0063 006F 006D||xn--l-ewm.com|
|9c||ẹl.com||1EB9 006C 002E 0063 006F 006D||xn--l-ewm.com|
In example 9, we have an even worse case. The underdot character in 9a is actually under the 'l', but with this font, it appears as under the 'e'! It is thus visually confusable with 9b (where the underdot is under the e) or the equivalent normalized form 9c.
Recommendation: Browsers and similar programs should follow the Unicode Standard guidelines to avoid spoofing problems. There is a technical note, UTN #2: Rendering Combining Marks, which provides information as to how this can be implemented even in the absence of font support.
To Do: add discussions of:
Other applications of visual spoofing, aside from the example of IDN. International domain names are actually in much better shape than many other areas, since the problem will be much more severe in any area where text is not normalized. So focus on those issues.
Bidirectional visual spoofs
Guidelines for registrars of identifiers subject to spoofing, and for displayers of identifiers. For IDNA, the latter two would be NICs and browsers.
Unicode properties. Eg more characters have numeric properties than developers might expect.
Use of Regular Expressions in validating data -- ensuring that the Regular Expression Engine follows the Unicode Guidelines, but also that use of regular expressions makes use of properties rather than fixed lists of characters.
Comparison and sorting
Discuss and/or point to other items:
Note to Reviewers: additional topics and links would be appreciated.
Steven Loomis and other people on the ICU team were very helpful in developing the original proposal for this technical report. Thanks also to the following people for their feedback or contributions to this document: Martin Dürst, Paul Hoffman.
To Do: comb through the text and convert the references to the standard form.
|[Feedback]||Reporting Errors and Requesting Information Online
|Reports]||Unicode Technical Reports
For information on the status and development process for technical reports, and for a list of technical reports.
|[UCD]||Unicode Character Database.
http://www.unicode.org/ucd/For an overview of the Unicode Character Database and a list of its associated files
|[Unicode]||The Unicode Consortium. The Unicode Standard, Version 4.0. Reading, MA, Addison-Wesley, 2003. 0-321-18578-1.|
|[Versions]||Versions of the Unicode Standard
For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.
The following summarizes modifications from the previous version of this document.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.