Security Issues FAQ
Q: I've heard claims that Unicode poses
security issues. Is that right?
A: A common security issue is 'spoofing', the deliberate
misspelling of a domain or user name to trick unaware users into
entering an interaction with a hostile site as if it was a trusted site.
To be effective, spoofing can be very approximate, e.g. using the digit
'1' instead of the letter 'l'. The Unicode Standard contains many
"confusables," that is, characters whose glyphs, due to historical
derivation or sheer coincidence, resemble each other more or less
closely. Certain security-sensitive applications or systems may be
vulnerable due to possible misinterpretation of these confusables by
their users. [AF] & [DE]
Q: How serious is the problem of spoofing with Unicode characters?
A: It is important to recognize that the use of visually confusable characters in spoofing is often overstated. Confusable characters account for a small proportion of phishing problems: most instances of phishing involve social engineering or simple misleading domain names such as "secure-wellsfargo.com". For more information, see http://www.bortzmeyer.org/idn-et-phishing.html. (It is in French, but you can use Google translate or other services to get the gist of the document if you don't read French.)
Q: Is this a problem that is unique to
A: No, many legacy character sets, including ISO/IEC
8859-1, also contain confusables (albeit usually fewer of them) and
carry the same risks when it comes to spoofing. [AF] & [DE]
Q: Why is it impossible to give
all characters that use the same glyph a single code?
A: Unicode encodes characters, not glyphs. By unifying an
encoding based strictly on appearance, many common text processing tasks
would become convoluted or impossible. For example, Latin 'B' and
Greek Beta 'Β' look the same in most fonts, but lowercase to two
different letters, Latin 'b' and Greek beta 'β', which have very
distinct appearances. [AF] & [DE]
Q: Are Latin characters commonly mixed with other scripts?
A: Yes, the widespread commercial use of English and other Latin-based languages, means that it is quite common to have Latin-script characters (especially ASCII) in text that principally consists of other scripts, such as "خدمة RSS". This mixing is sometimes prohibited for identifiers, but not always.
Q: Where can I find out more about security issues with Unicode and globalization software?
A: For general explanations of issues and recommended approaches, see UTR #36, Unicode Security Considerations. For recommended mechanisms (and data for implementing them) for handling certain security issues, see UTS #39, Unicode Security Mechanisms.
Q: Where can I find out about security issues connected with Internationalized Domain Names (IDNs)?
A: See the Internationalized Domain Names (IDN) FAQ and UTS #46, Unicode IDNA Compatibility Processing.
Q: How can funny user names cause problems?
A: There is very nicely written article at http://labs.spotify.com/2013/06/18/creative-usernames/ that shows how incorrect implementation of idempotence allowed a security breach when certain characters were allowed. This article also mention problems with mismatched versions of Unicode libraries, which may also cause security issues.
Q: Where can I find some examples of coding exploits involving ill-formed strings?
A: See http://blogs.technet.com/srd/archive/2009/05/18/more-information-about-the-iis-authentication-bypass.aspx.
Q: Where can I find out more about the Paypal domain name scam?
A: There's a Wiki page about the "PaypaI" scam: https://en.wikipedia.org/wiki/PayPaI.