[Unicode]  Technical Reports
 

Draft Unicode Technical Report #36

Security Considerations for the Implementation of Unicode and Related Technology

Authors Mark Davis (mark.davis@us.ibm.com)
Date 2005-02-20
This Version http://www.unicode.org/reports/tr36/tr36-2.html
Previous Version http://www.unicode.org/reports/tr36/tr36-1.html
Latest Version http://www.unicode.org/reports/tr36/
Revision 2

 

Summary

This document describes security considerations that are important to be aware of when working with Unicode, and provides specific recommendations for dealing with the issues that arise.

Status

This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium.  This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents

Note to Reviewers: The original working title was "Unicode Security Considerations". Should the above title be changed back to that, or changed to something else, eg "Unicode Security Recommendations"? Feedback is welcome.


1. Introduction

Unicode represents a very significant advance over all previous methods of encoding characters. For the first time, all of the worlds characters could be represented in a uniform manner, for the first time making it feasible for the vast majority of programs to be globalized: built to handle any language in the world.

In many ways, the use of Unicode makes programs much more robust and secure. When systems need to use a hodge-podge of different charsets for representing characters, it was possible to take advantage of differences between those charsets, or in the way in which programs converted to and from them.

However, because Unicode contains such a large number of characters, and because it incorporates the varied writing systems of the world, incorrect usage can expose programs or systems to possible security attacks. This document describes some of the security considerations that should be taken into account by programmers, system analysts, standards-developers, and users.

We anticipate that this document will grow over time, adding additional sections as needed. Initially, there are two areas that will be discussed: canonical representation and visual spoofing. For more information, see also the Unicode FAQ on Security Issues.

Each section below presents a background information on the kinds of problems that can occur, then a list of specific recommendations for avoiding the problems.

Note to Reviewers: Some of the examples below use Unicode characters which some browsers will not show, or may not show in a way that illustrates the problem. For more information about improving the display, see [Display]. In the final version, we'll prepare GIFs for the characters where necessary.

2. Canonical Representation

A common practice is to have a 'gatekeeper' for a system. That gatekeeper checks over incoming data to ensure that it is safe, and passes only safe data through. Once in the system, the other components assume that the data is safe. A problem arises when a component treats two pieces of text as identical — typically by canonicalizing them to the same form — while the gatekeeper only detected that one of them was unsafe.

UTF-8 Exploit

There are three equivalent encoding forms for Unicode: UTF-8, UTF-16, and UTF-32. UTF-8 is commonly used in XML and HTML; UTF-16 is the most common in program APIs; and UTF-32 is the best for representing single characters. While these forms are all equivalent in terms of the ability to express Unicode, the original usage of UTF-8 was open to a canonicalization exploit.

Up to The Unicode Standard, Version 3.0 the generation of "non-shortest form" UTF-8 was forbidden, and as was the interpretation of illegal sequences, but not the interpretation of what was called the "non-shortest form". Where software does interpret the non-shortest forms, security issues can arise. For example:

For example, the backslash character "\" can often be a dangerous character to let through a gatekeeper, since it can be used to access different directories. Thus a gatekeeper might specifically prevent it from getting through. The backslash is represented in UTF-8 as the byte sequence <5C>. However, as a non-shortest form, backslash could also be represented as the byte sequence<C1 9C>. When a gatekeeper doesn't catch that, but a component converts non-shortest forms, it can allow a real security breech. For more information, see http://www.microsoft.com/technet/security/bulletin/MS00-078.mspx and http://www.ins.com/downloads/whitepapers/ins_white_paper_ms_iis_unicode_exploit_0801.pdf.

To address this issue, the Unicode Technical Committee modified the definition of UTF-8 in Unicode 3.1 to forbid conformant implementations from interpreting non-shortest forms for BMP characters, and clarified some of the conformance clauses.

Note to Reviewers: To Do:

Add information about other possible exploits in this area:

Recommendations

  1. Ensure that all implementations of UTF-8 used in a system are conformant to the latest version of Unicode.

3. Visual Spoofing

Visual spoofing is where a similarity in visual appearance fools a user, and causes him or her to take unsafe actions. This is not new to Unicode: it was possible to spoof simply with ASCII character: "inteI.com" for example, uses a capital I instead of an L. The infamous example here is of course "paypaI.com": 

... Not only was "Paypai.com" very convincing, but the scam artist even goes one step further. He or she is apparently emailing PayPal customers, saying they have a large payment waiting for them in their account.

The message then offers up a link, urging the recipient to claim the funds. But the URL that is displayed for the unwitting victim uses a capital "i" (I), which looks just like a lowercase "L" (l), in many computer fonts. ...

Beware the 'PaypaI' scam

And the spoofs nowadays are pretty clever. One is an email that looks like it comes from a trusted source, like your bank. It even has an explicit disclaimer to not trust links in email, and directs you to copy text to your address bar in your browser. The text looks ok to you, so you won't realize that you are going to a completely different site, which is then set up to simulate your bank well enough to get your password.

These spoofs depend on the use of visually confusable strings:

D1. Two different strings of Unicode characters are said to be visually confusable when their appearance in common fonts in small sizes at screen resolutions is sufficiently close that people easily mistake one for the other.

There are no hard-and-fast rules for visual confusability: it is of course possible to make any characters look like any others with a suitably faulty font. By "small-sizes at screen resolutions", this means fonts whose ascent + descent is from 9 to 12 pixels for most scripts, somewhat larger for scripts where the font size users typically have is larger, such as Japanese. Of course, at sufficiently small sizes, such as 4px, a great many characters would become confusable. In some cases sequences of characters can be used to spoof: for example, "rn" ("r" followed by "n") in many san-serif fonts is visually confusable with "m". Where two different strings are essentially identical in most fonts at all sizes, they are called homographs. However, spoofing is not dependent on just homographs; if the visual appearance is close enough at small sizes, that can be sufficient to cause problems.

Note that characters are not visually confusable if the positioning of the glyph is sufficiently different. For example, foo·com (using the hyphenation point instead of the period) should be distinguishable from foo.com by the positioning of the dot (except in faulty fonts).

To a certain extent, the new forms of visual spoofing available with Unicode are a matter of degree and not kind. However, because of the very large number of Unicode characters (over 94,000 in the current version), the number of opportunities for visual spoofing are significantly larger than with a restricted character set such as ASCII.

For examples of visually confusable characters, see [confusables].

International Domain Names

Spoofing is an especially important subject given the recent introduction of international domain names (IDN). There is a natural desire for people to see domain names in their own languages and writing systems; English speakers can understand this if they consider what it would be like if they always had to type web addresses with Russian characters! So IDN represents a very significant advance for most people in the world. The avoidance of spoofing vulnerability requires proper implementation in browsers and other programs, so as to minimize security risks without making the use of non-ASCII character too onerous.

International domain names are, of course, not the only cases where visual spoofing can occur. For example, you might get a message asking you to allow allowing the installation of software from "IBM", authenticated with the proper Verisign certificate, but the "M" character happens to be the Russian (Cyrillic) character that looks precisely like the English "M". Any place where strings are used as identifiers is subject to this kind of spoofing. For more information on identifers, see UAX #31: Identifier and Pattern Syntax.

However, IDN provides a good starting point for a discussion of visual spoofing. The good news is that the design of IDN prevents a huge number of spoofing attacks. All conformant users of IDN are required to process domain names to convert compatibility-equivalent characters into a unique form; this processing eliminates most of the possibilities for visual spoofing by mapping away a large number of visually confusable characters and sequences. For example, Unicode contains the "ä" (a-umlaut) character, but also contains a free-standing umlaut ("¨") which can be used in combination with any character, including an "a". But the compatibility normalization will convert any sequence of "a" plus "¨" into the regular "ä".

Thus you can't spoof an a-umlaut with a + umlaut; it simply results in the same domain name. See example 1 below. The String column shows the actual characters; the UTF-16 shows the underlying encoding, while the IDNA column shows the IDNA format used to represent the string internally in International Domain Names.

Safe Domain Names
  String UTF-16 IDNA
1a ät.com 0061 0308 0074 002E 0063 006F 006D xn--t-zfa.com
1b ät.com 00E4 0074 002E 0063 006F 006D xn--t-zfa.com

Note: The ICU demo at http://ibm.com/software/globalization/icu/demo/domain/ can be used to demonstrate the results of processing different domain names. That demo was also used to get the IDNA values shown here.

The IDN processing also removes case distinctions by performing a case folding to reduce characters to a lowercase form. This is also useful for avoiding spoofing problems, since characters are generally more distinctive in their lowercase forms. That means that we can focus on just the lowercase characters.

For a list of allowable characters in IDN, see [idn-chars]. There are many misperceptions about which characters are allowed in IDN, so referencing this explicit list should be useful for dispelling some of them. The characters are those left after string processing has been performed, so case-folding and normalization have already been applied.

Although normalization and case-folding prevent many possible spoofing attacks, there remain many cases where visual spoofing can still occur with international domain names. Ideally, much of this would be handled on the registries' side instead of user-agents (browsers, emailers, and other programs that display and process URLs). The registry has the most data available, and process it the most efficiently at the time of registration, using policies to reduce visual spoofing. For example, given confusable mapping data, the registry can easily determine if a proposed registration conflicts with an existing one; that is much more difficult for user agents because of the sheer number of combinations that it would have to probe.

However:

So efforts need to be made on the part of user-agents as an additional line of defense.

Note: since the top-level domain names (TLD: .com, .ru, etc.) is currently always ASCII, all discussions below of the domain names pertain to all but the top level.

Cross-Script Spoofing

Visually confusable characters are not usually unified across scripts. Thus a Greek omicron is encoded as a different character from the Latin "o", even though it is usually identical or nearly identical in appearance. There are good reasons for this: often the characters were separate in legacy encodings, and preservation of those distinctions was necessary for existing data to be mapped to Unicode without loss. Moreover, the characters generally have very different behavior: two visually confusable characters may be different in casing behavior, in category (letter vs. number), or in numeric value. After all, ASCII doesn't unify lowercase L and digit 1, even though those are visually confusable. Encoding the Cyrillic character б (corresponding to the letter "b") by using the numeral 6, would clearly have been a mistake, even though they are visually confusable.

However, the existence of these cases means that there is a significant number of spoofing possibilities using characters from different scripts. For example, a domain name can be spoofed by using a Greek omicron instead of an 'o', as in example 2a.

Cross-Script Spoofing
  String UTF-16 IDNA
2a tοp.com 0074 03BF 0070 002E 0063 006F 006D xn--tp-jbc.com
2b tοp.com 0074 006F 0070 002E 0063 006F 006D top.com

There are many legitimate uses of mixed scripts. Because of the prevalence of Latin characters, it is quite common, for example, to use English words (with Latin characters) in the middle of other languages using other scripts. For example, one could have XML-документы.com (which would be a site for "XML documents" in Russian). Even in English, legitimate product or organization names or may contain non-Latin characters, such as Ωmega, Teχ, Toys-Я-Us, or HλLF-LIFE. The lack of IDNs in the past has also led to the usage in some registries (such as the .ru TLD) where Latin names have been used to create pseudo-cyrillic names in the .ru tld. For example, see http://caxap.ru/ (сахар means sugar in Russian).

The Unicode Standard supplies information that can be used for detecting mixed-script text: for more information, see UAX #24: Script Names.

Cyrillic and Latin represent special challenges, since the number of common glyphs shared between them is so high, as can be seen from [idn-chars]. It may be possible to compose an entire domain name (except the TLD) in Cyrillic using letters that will be essentially always identical in form to Latin letters, such as "scope.com": with "scope" in Cyrillic looking just like "scope" in Latin. These are called whole-script confusables.

In-Script Spoofing

While compatibility normalization and mixed-script detection can handle the vast majority of cases, there are other visual confusables that could cause problems. With fonts increasing able to handle international characters, and especially with smaller font sizes in the context of an address bar, these visual confusables could be used to spoof. Importantly, these problems can be illustrated with common, widely available fonts on widely available operating systems this is not pointing a finger at any one vendor.

Consider the following examples, all in the same script. In each numbered case, in commonly available browsers, the strings will look identical.

Spoofed Domain Names
  String UTF-16 IDNA
3a a‐b.com 0061 2010 0062 002E 0063 006F 006D xn--ab-v1t.com
3b a-b.com 0061 002D 0062 002E 0063 006F 006D a-b.com
 
4a so̷s.com 0073 006F 0337 0073 002E 0063 006F 006D xn--sos-rjc.com
4b søs.com 0073 00F8 0073 002E 0063 006F 006D xn--ss-lka.com
 
5a z̵o.com 007A 0335 006F 002E 0063 006F 006D xn--zo-pyb.com
5b ƶo.com 01B6 006F 002E 0063 006F 006D xn--o-zra.com
 
6a an͂o.com 0061 006E 0342 006F 002E 0063 006F 006D xn--ano-0kc.com
6b año.com 0061 00F1 006F 002E 0063 006F 006D xn--ao-zja.com
 
7a Đo.org 0110 006F 002E 006F 0072 0067 xn--o-kia.org
7b Ɖo.org 0189 006F 002E 006F 0072 0067 xn--o-40a.org

 

Inadequate Rendering Support

An additional problem arises when a font and/or rendering engine has inadequate support for certain sequences of characters. These are characters that should be visually distinguishable, but don't appear that way. In example 8a, the a-umlaut is followed by another umlaut. The Unicode Standard guidelines indicate that the second umlaut should be 'stacked' above the first, producing a distinct visual difference. But as this example shows, common fonts will simply superimpose the second umlaut; and if the positioning is close enough, the user will not see a difference between 8a and 8b.

Inadequate Rendering Support
  String UTF-16 IDNA
8a ä̈t.com 00E4 0308 0074 002E 0063 006F 006D xn--t-zfa85n.com
8b ät.com 00E4 0074 002E 0063 006F 006D xn--t-zfa.com
 
9a eḷ.com 0065 006C 0323 002E 0063 006F 006D xn--e-zom.com
9b ẹl.com 0065 0323 006C 002E 0063 006F 006D xn--l-ewm.com
9c ẹl.com 1EB9 006C 002E 0063 006F 006D xn--l-ewm.com

In example 9, we have an even worse case. The underdot character in 9a is actually under the 'l', but in many fonts, it appears as under the 'e'! It is thus visually confusable with 9b (where the underdot is under the e) or the equivalent normalized form 9c.

Syntax Spoofing

Spoofing syntax characters can be even worse than regular characters. For example, U+2044 '⁄' FRACTION SLASH can look like '/' in many fonts (ideally the spacing and angle is sufficiently different as to be distinguishable, but this is not always maintained. This allows http://www.example.org/not.mydomain.com to pretend to be in the example.org domain, whereas it is actually the subzone www.example.org/not in the domain mydomain.com. Thus anything that is visually similar to '.', '/', '#', is especially dangerous. Most of these cases, such as U+2024 (·) ONE DOT LEADER are disallowed by StringPrep, but not all.

It is important also not to show a missing glyph or character with a simple "?", since that makes every such character be visually confusable with a real question mark. Instead, follow the Unicode guidelines for displaying missing glyphs using a rounded-rectangle, as described in Section 5.3 Unknown and Missing Characters of [Unicode].  For examples of this, see also [Charts].

Numeric Spoofs

Turning away from IDN for a moment, there is another area where visual spoofs can be used. Many scripts have sets of decimal digits that are different in shape that the typical European digits {0 1 2 3 4 5 6 7 8 9}. For example, Bengali has {০ ৯}, while Oriya has { ୯}. While the sets taken as a whole are different in shape, individual digits may have the same shapes as digits from other scripts, even digits of different values. For example, the string is visually confusable with 89 (at small sizes), but actually has the numeric value 42! Where software simply interprets the numeric value of a string of digits, without detecting that the digits are from different scripts, it is possible to generate such spoofs.

Recommendations

We are in the process of gathering data that would allow for a finer-grained approach, but until such time as that is more comprehensive, we'd recommend having a more conservative stance. It is always easier to widen restrictions than narrow them. We do expect these recommendations to be refined over time.

Some people have proposed prevention of spoofing by restricting domain names according to language. In practice, that is very problematic. It is very difficult to determine the intended language of many terms, especially product or company names, which are often constructed to be neutral regarding language. Moreover, languages tend to be quite fluid; foreign words are continually being adopted. Except for registries with very special policies (such as the blocking used by some East Asian
registries such as described in RFC 3743), the language association doesn't make too much sense.

Instead, what is recommended is a combination of string preprocessing to remove basic equivalences, promoting adequate rendering support, and putting restrictions in place according to script and restricting by confusable characters. While the ICANN guidelines say "top-level domain registries will (a) associate each registered internationalized domain name with one language or set of languages" (http://www.icann.org/general/idn-guidelines-20jun03.htm), that is better interpreted as limiting to script rather than language.

In the following, "appropriate alerts" are recommended. The form of such alerts could be minimal, such as special coloring or icons (perhaps with a tool-tip for more information), or more in-your-face, such as an alert dialog describing the issue and requiring user confirmation before continuing. The strength of the alert can be scaled according to the level of the potential problem. The user-agent could also remember when the user has accepted an alert, for say Ωmega.com, and permit future access without bothering the user again with an alert.

The term "Registry" is to be interpreted broadly. The .com operator can impose restrictions on the 2nd level domain label,
but if someone registers foo.com, then it's up to them to decide what will be allowed at the 3rd level (e.g. bar.foo.com). So for that purpose, the owner of foo.com is treated as the "registry" for the 3rd level (the 'bar'). The term "Registrant" is used to refer to someone applying to a registry for a domain name.

Also see the security discussions in [IRI], [URI], and [StringPrep].
 

  1. General

    1. In checking equivalence of identifiers, preprocess both strings by applying NFKC and case folding. An example of this methodology is StringPrep (RFC XXX). Display all such identifiers to users in their post-processed form. That will, for example, always map characters to lowercase, where characters are generally more distinctive.
      • Although StringPrep itself is currently limited to Unicode 3.2, the same methodology can be applied by implementations that need to support more modern versions of Unicode.
    2. Disallow the use of non-spacing marks at the start of an identifier, since those will often be clipped in display.
    3. Assess the font support of the OS/platform according to draft UTR #32: Assessing Unicode Support (to be issued soon). See also the W3C [CharMod]. If it is inadequate, work with the OS/platform vendor to address those problems, or implement your own handling of problematic cases:
      1. Verify that accents do not appear to apply to the wrong character.
      2. Follow UTN #2: Rendering Combining Marks in providing layout of non-spacing marks that would otherwise collide. If this is not done, use the "Show Hidden" option of Section 5.13 Rendering Nonspacing Marks of [Unicode] for the display of non-spacing marks.
      3. Follow the Unicode guidelines for displaying missing glyphs using a rounded-rectangle, as described in Section 5.3 Unknown and Missing Characters of [Unicode].
    4. In all software that parses numbers, detect digits of mixed (or unexpected) scripts.
  2. Domain Names: User Agents
    1. Follow the General recommendations above.
    2. Always display the domain name in Stringprepped form. That will, for example, always map characters to lowercase, where characters are generally more distinctive.
    3. Display the URL with fonts that shows as many distinctions as possible among the visually confusable characters. For example, serifed fonts are generally better than sans serif. Use a size that makes it easier to see the differences in characters. Disallow the use of font sizes that are so small as to cause even more characters to be visually confusable.
    4. Provide the following optional Security Levels according to the contents of the domain name (excluding the TLD). Any domain names only satisfying a higher level would generate appropriate alerts.
      1. No IDN
        • ASCII only
      2. Minimal (default -- short term)
        • any single script, or Han + Hiragana + Katakana, or Han + Bopomofo, or Han + Hangul,
        • plus Common, plus Inherited script characters
        • but:
        • exclude special-purpose characters (see [special])
        • (possible future extension) exclude any combining character sequences outside of NamedSequences.txt (Unicode 4.1)
      3. Moderate = Minimal +
        • allow Latin with other scripts except Cyrillic, Greek, Cherokee
      4. Expanded = Moderate +
        • allow mixtures of scripts, e.g. Ωmega.com, Teχ.org, HλLF-LIFE.com
        • allow more excluded characters (data — TBD)
      5. Unrestricted
        • any valid IDN, except those starting with combining marks, or using visually confusable syntax characters
    5. Once confusable data is more comprehensive, raise the default level to Level3, but alert on domain names where there is some mapping of confusable characters that would result in a different domain name that would also be allowed at that security level. For example: "ѕсоре.com" (with "ѕсоре" in Cyrillic looking like "scope" in Latin) would cause an alert at Level 2, even though it is a single script (excluding the TLD).
  3. Domain Names: Registries
    1. In the short term, use Level 2 to restrict new domain names.
    2. Once the confusable data is more comprehensive, raise the allowable level to Level 5, but disallow registration of a new domain name that only differs by confusable characters from one already registered.
  4. Domain Names: Registrants
    1. Verify that the Registry follows appropriate guidelines for preventing spoofing.
    2. If the desired domain name has any whole-script confusables (such as "scope" in Latin and Cyrillic), register those as well.

Note to Reviewers: To Do:

4. Data Files

There are three data files currently associated with this document.

[idn-chars] IDN Characters: idn-chars.html
Lists all the possible IDN chars, after StringPrep is performed. Contains all the possible characters, sorted by script, then whether atomic (non-decomposible) or decomposable, then according to UCA collation order. Scripts that are bicameral (have both upper and lower cases) are further divided into two sets, based on whether letters have both cases or not. If your browser supports tool-tips, hovering the mouse over any character will show its name and code point.
[confusables] Visually Confusable Characters: confusables.txt
The format and usage of the file are described in the file header.
Note: we are just starting the project of collecting this data, and examining the feasibility of different approaches, so we have just begun to gather data in this file.
[special] Special-Purpose Characters: special_purpose.txt
Characters that are not in common modern use.
Note: we are just starting the project of collecting this data, and examining the feasibility of different approaches, so we have just begun to gather data in this file.

Acknowledgements

Steven Loomis and other people on the ICU team were very helpful in developing the original proposal for this technical report. Thanks also to the following people for their feedback or contributions to this document or earlier versions of it: Martin Dürst, Paul Hoffman, Eric Muller, and especially Michel Suignard. This document also draws on examples or ideas suggested in email discussions from Alexander Savenkov, Eric van der Poel, and others.

References

To Do: comb through the text and convert the references to the standard form.

[CharMod] Character Model for the World Wide Web 1.0: Fundamentals
http://www.w3.org/TR/charmod/
[Charts] Unicode Charts
http://www.unicode.org/charts/
[Display] Display Problems?
http://www.unicode.org/help/display_problems.html
[IRI] RFC 3987 Internationalized Resource Identifiers (IRIs). M. Duerst, M. Suignard. January 2005.
http://ietf.org/rfc/rfc3987.txt
[Feedback] Reporting Errors and Requesting Information Online
http://www.unicode.org/reporting.html
Reports] Unicode Technical Reports
http://www.unicode.org/reports/
For information on the status and development process for technical reports, and for a list of technical reports.
[StringPrep] RFC 3454 Preparation of Internationalized Strings ("stringprep"). P. Hoffman, M. Blanchet. December 2002.
http://ietf.org/rfc/rfc3454.txt
[UCD] Unicode Character Database.
http://www.unicode.org/ucd
For an overview of the Unicode Character Database and a list of its associated files
[Unicode] The Unicode Consortium. The Unicode Standard, Version 4.0. Reading, MA, Addison-Wesley, 2003. 0-321-18578-1.
[URI] RFC 3986 Uniform Resource Identifier (URI): Generic Syntax. T. Berners-Lee, R. Fielding, L. Masinter. January 2005.
http://ietf.org/rfc/rfc3986.txt
[Versions] Versions of the Unicode Standard
http://www.unicode.org/standard/versions
For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.

Modifications

The following summarizes modifications from the previous revision of this document.

Revision 2:

Revision 1: