Unicode Frequently Asked Questions

Security Issues FAQ

Q: Does Unicode pose special security issues?

A common security issue is 'spoofing', the deliberate misspelling of a domain or user name to trick unaware users into entering an interaction with a hostile site as if it was a trusted site. To be effective, spoofing only needs to be very approximate, for example, using the digit '1' instead of the letter 'l'. The Unicode Standard contains many "confusables," that is, characters whose glyphs, due to historical derivation or sheer coincidence, resemble each other more or less closely. Certain security-sensitive applications or systems may be vulnerable due to possible misinterpretation of these confusables by their users. [AF] & [DE]

Q: How serious is the problem of spoofing with Unicode characters?

It is important to recognize that the use of visually confusable characters in spoofing is often overstated. Confusable characters account for a small proportion of phishing problems: most instances of phishing involve social engineering or simple misleading domain names such as "secure-wellsfargo.com".  For more information, see http://www.bortzmeyer.org/idn-et-phishing.htmlexternal link. (It is in French, but you can use Google translate or other services to get the gist of the document if you don't read French.)

Q: Is spoofing a problem that is unique to Unicode?

No, many legacy character sets, including ISO/IEC 8859-1, also contain confusables (albeit usually fewer of them) and  Unicode contains many unfamiliar historical forms, as well as some scripts that share letter forms, such as Cyrillic and Latin, or Telugu and Kannada. Useful mitigation steps that mainly apply to Unicode include restricting critical identifiers to modern-use characters of a single script per identifier. [AF] & [DE]

Q: Why is it impossible to give all characters that use the same glyph a single code?

Unicode encodes characters, not glyphs. By unifying an encoding based strictly on appearance, many common text processing tasks would become convoluted or impossible. For example, Latin 'B' and Greek Beta 'Β' look the same in most fonts, but lowercase to two different letters, Latin 'b' and Greek beta 'β', which have very distinct appearances. [AF] & [DE]

Q: Are Latin characters commonly mixed with other scripts?

The widespread commercial use of English and other Latin-based languages means that it is quite common to have Latin-script characters (especially ASCII) in text that principally consists of other scripts, such as "خدمة RSS". This mixing is sometimes prohibited for identifiers. For scripts that don't share shapes that can be confused with ASCII, allowing such mixing in identifiers is not too problematic, but it is best avoided in other scripts that contain letters with similar shapes.

Q: Where can I find out more about security issues with Unicode and globalization software?

For general explanations of issues and recommended approaches, see UTR #36, Unicode Security Considerations. For recommended mechanisms (and data for implementing them) for handling certain security issues, see UTS #39, Unicode Security Mechanisms.

Q: Where can I find out about security issues connected with Internationalized Domain Names (IDNs)?

See the Internationalized Domain Names (IDN) FAQ and UTS #46, Unicode IDNA Compatibility Processing. For an in-depth discussion of issues for IDNs for commonly-used modern scripts, including mitigation steps that can be implemented on the registry level, look up the specifications and proposal documents for the DNS Root Zone Label Generation Rulesexternal link. [AF]

Q: How can funny user names cause problems?

There is very nicely written article at http://labs.spotify.com/2013/06/18/creative-usernames/external link that shows how incorrect implementation of idempotence allowed a security breach when certain characters were allowed. This article also mention problems with mismatched versions of Unicode libraries, which may also cause security issues.

Q: Where can I find some examples of coding exploits involving ill-formed strings?

See https://msrc-blog.microsoft.com/2009/05/18/more-information-about-the-iis-authentication-bypassexternal link.

Q: Where can I find out more about the Paypal domain name scam?

There's a Wiki page about the "PaypaI" scam: https://en.wikipedia.org/wiki/PayPaIexternal link.