[Unicode]  Frequently Asked Questions Home | Site Map | Search

Internationalized Domain Names (IDN) FAQ

Q: What is an Internationalized Domain Name (IDN)?

A: Domain names, such as "macchiati.blogspot.com", were originally designed only to support ASCII characters. In 2003, a specification was released that allows most Unicode characters to be used in domain names. IDNs are supported by all modern browsers and email programs, so people can use links in their native languages, such as http://Bücher.de.

Q: Do IDNs change the Domain Name System (DNS)?

A: No. Internally, the non-ASCII Unicode characters are transformed into a special sequence of ASCII characters. So as far as the DNS system is concerned, all domain names are just ASCII.

Q: When will IDNs be available?

A: IDNs have been defined and in use since 2003, under a system called "IDNA2003". There was a lot of news about Internationalized Domain Names being made available by ICANN in November of 2009, but many of the reports were misleading: it is only the top level domains, like the "org" in "unicode.org" that couldn't use non-ASCII characters.

Q: What is IDNA2008?

A: It is a revision of IDNA2003, approved in 2010. For most Unicode characters it produces the same results as IDNA2003, but there are important classes of characters for which it is not backwards compatible with IDNA2003. See [RFC Numbers at http://www.unicode.org/reports/tr46/#IDNA2008].

Q: How does IDNA2008 differ from IDNA2003?

A: It disallows about eight thousand characters that used to be valid, including all uppercase characters, full/half-width variants, symbols, and punctuation. It also interprets four characters differently.

Q: Which four characters are interpreted differently?

A: Four characters can cause an IDNA2008 implementation to go to a different web page than an IDNA2003 implementation, given the same source, such as href="http://faß.de". These four characters include some that are quite common in languages such as German, Greek, Farsi, and Sinhala:

U+00DF ( ß ) LATIN SMALL LETTER SHARP S
U+03C2 ( ς ) GREEK SMALL LETTER FINAL SIGMA
U+200C ( ) ZERO WIDTH NON-JOINER
U+200D ( ) ZERO WIDTH JOINER

For the purposes of discussion of differences between IDNA versions, these characters are called "deviations".

Q: What is UTS #46?

A: UTS #46, also sometimes referred to as "TR46", is a Unicode specification that allows implementations to handle domain names compatibly during the transition from IDNA2003 to IDNA2008. The title is "Unicode IDNA Compatibility Processing".

UTS #46 also provides a preprocessing specification for mapping that can be used with a standard IDNA2008 implementation.

Q. Is UTS #46 an IETF publication?

A: No, IDNA2008 is an IETF specification, while UTS #46 is a specification of the Unicode Consortium.

Q: Why is UTS #46 necessary?

A: Browsers and other client software need to support existing pages, which were constructed under the IDNA2003 interpretation of international domain names. They also need to continue meet their user's expectations, such as being able to type in IDNs with capital letters, or to use the ideographic period in Japanese or Chinese domain names. In particular, the 4 "deviation" characters have the opportunity to cause significant security and usability problems; they and symbols can be phased out over time, but need some transitional support.

UTS #46 provides a compatibility bridge that allows implementations to handle both IDNA2003 and IDNA2008 domain names. For the specification and more background information, see UTS #46.

Q: What are examples of IRIs where characters behave differently under IDNA2008?

A: Here is a table showing internationalized domain names in the context of IRIs, illustrating the differences in characters:

URL IDNA2003 UTS #46 IDNA2008 Comments
href="http://öbb.at"

Valid

Valid

Valid

Simple characters

href="http://ÖBB.at"

Valid†

Valid†

Disallowed

Case mapping is not part of IDNA2008

href="http://√.com"

Valid

Valid

Disallowed

Symbols are disallowed in IDNA2008

href="http://faß.de"

Valid†

Valid†

Valid

Deviation (different resulting IP address in IDNA2008)

href="http://ԛәлп.com"

Valid‡

Valid

Valid

IDNA2003 only allows Unicode 3.2 characters, excluding U+051B (ԛ) cyrillic qa

href="http://Ⱥbby.com"

Valid‡

Valid†

Disallowed

IDNA2003 only allows Unicode 3.2 characters, excluding U+023A ( Ⱥ ) latin A with stroke; Case mapping is not part of IDNA2008

Mapped to different characters, eg lowercase.

Note that the Unicode characters after 3.2 were valid on lookup, but not for registration.

For a more detailed account of the similarities and differences, with character counts, see Section 8, IDNA Comparison in UTS #46. For a demonstration of differences between IDNA2003, IDNA2008, and the Unicode IDNA Compatibility Processing, see the IDNA demo.

Q: What are the main advantages of IDNA2008?

A: The main advantages are:

  • Updates the repertoire of allowed characters from Unicode 3.2 to Unicode 5.2

  • Makes process of updating to future Unicode versions (mostly) automatic

  • Allows needed sequences (combining marks at the end of bidi label)

  • Improves BIDI restrictions (Arabic/Hebrew)

  • Clarifies that what people register is the unmapped form of a domain name

  • Makes it clear exactly what strings can be registered

Q: What are the disadvantages of IDNA2008?

A: If IDNA2003 had not existed, then there would be few disadvantages to IDNA2008. Given that IDNA2003 does exist, and is widely deployed, during the transition period the main disadvantages are:

  • Changes the interpretation of the 4 characters known as Deviations

  • Discontinues IDNA2003 case mappings and mappings for other variants

  • Excludes symbols and punctuation

  • Allows arbitrary 'local' mappings, which may result in the same IRI resolving to different IP addresses, depending on the mapping used

Q: How much of a problem is this actually if support for symbols like √ were just dropped immediately?

A: While http://√.com is valid in an IDNA2003 implementation, it would fail on a IDNA2008 implementation. This affects 3,254 characters, most of which are rarely used. A small percentage of those are security risks because of confusability. The vast majority are unproblematic: for example, having http://I♥NY.com doesn't cause security problems. IDNA2008 has additional tests that are based on the context in which characters are found, but they apply to few characters, and don't provide any appreciable increase in security.

Q: Doesn't the removal of symbols and punctuation in IDNA2008 help security?

A: Surprisingly, not really. The vast majority of security exploits are of the form "security-wellsfargo.com", where no special characters are involved. For more information, see Stéphane Bortzmeyer's blog entry, idn-et-phishing (in French). The most interesting studies cited there (originally from Mike Beltzner of Mozilla) are:

Even among the fraction that are confusable characters, IDNA2008 doesn't do anything about the most frequent sources of character-based spoofing: look-alike characters that are both letters, like "http://paypal.com" with a Cyrillic "a". If a symbol that can be used to spoof a letter X is removed, but there is another letter that can spoof X is retained, there is no net benefit.

According to data from Google the removal of symbols and punctuation in IDNA2008 reduces opportunities for spoofing by only about 0.000016%, weighted by frequency. In another study at Google of a billion web pages, the top 277 confusable URLs used confusable letters or numbers, not symbols or punctuation. The 278th page had a confusable URL with × (U+00D7 MULTIPLICATION SIGN - by far the most common confusable); but that page could could be even better spoofed with х (U+0445 CYRILLIC SMALL LETTER HA), which normally has precisely the same displayed shape as "x".

For a demo of confusable characters, and the effects of various restrictions, see the confusables demo.

Programmers still need to be aware of those issues as detailed in UTR #36: Unicode Security Considerations [UTR36], including the mechanisms for detecting potentially visually-confusable characters are found in the associated UTS #39: Unicode Security Mechanisms [UTS39].

Q: How does IDNA2008 improve handling of Arabic and Hebrew (BIDI)?

A: Arabic and Hebrew writing systems are known as bidi (bidirectional) because text runs from right-to-left and numbers (or embedded Latin characters) from left-to-right. IDNA2008 does a better job of restricting labels that lead to "bidi label hopping". This is where bidi reordering causes characters from one label to appear to be part of another label. For example, "B1.2d" in a right-to-left paragraph (where B stands for an Arabic or Hebrew letter) would display as "1.2dB". For more information, see the Unicode bidi demo.

While these new bidi rules go a long way towards reducing this problem, they do not completely eliminate it because they do not check for inter-label problems.

Q: Are the local mappings in IDNA2008 just a UI issue?

A: No, not if what is meant is that they are only involved in interactions with the address bar.

Example:

  • Alice sees that a URL works in her browser (say http://faß.de or http://TÜRKIYE.com). She sends it to Bob in an email. Bob clicks on the link in his email, and doesn't find a site or goes to a wrong (and potentially malicious) site, because his browser maps to http://fass.de or http://türkiye.com while Alice's maps to http://faß.de or http://türkıye.com.

There are parallel examples with web pages, IM chats, Word documents, etc.

  • Alice creates a web page, using <a href="http://faß.de"> (or http://TÜRKIYE.com). Bob clicks on the link in the web page, and doesn't find a site or goes to a wrong (and potentially malicious) site.

  • Alice is in a IM chat with Bob. She copies in http://faß.de (or http://TÜRKIYE.com) and hits return. Bob clicks on the link he sees in his chat window. Bob clicks on the link in his email, and doesn't find a site or goes to a wrong (and potentially malicious) site.

  • Alice sends a Word document to Bob with a link in it...

  • Alice creates a PDF document...

Q: Do the local-mapping exploits require unscrupulous registries?

A: No. The exploits do not require unscrupulous registries—it only requires that registries fail to police every URL that they register for possible spoofing behavior.

The local mappings matter to security, because entering the same URL on two different browsers may go to two different IP addresses when the two browsers have different local mappings. The same thing could happen within an emailer that is parsing for URLs, and then opening a browser. Moreover, they are even more problematic if they affect the interpretation of web pages, in such as cases like href="http://TÜRKIYE.com".

Q: Why does IDNA2003 map final sigma (ς) to sigma (σ), map eszett (ß) to "ss", and delete ZWJ/ZWNJ?

A: This decision about the mapping of these characters followed recommendations for case-insensitive matching in the Unicode Standard. These characters are anomalous: the uppercase of ς is Σ, the same as the uppercase of σ. Note that the text "ΒόλοΣ.com", which appears on http://Βόλος.com, illustrates this: the normal case mapping of Σ is to σ. If σ and ς were not treated as case variants in Unicode, there wouldn't be a match between ΒόλοΣ and Βόλος.

Similarly, the standard uppercase of ß is "SS", the same as the uppercase of "ss". Note, for example, that on the German language page for http://www.uni-giessen.de, "Gießen" is spelled with ß, but the logo for the university (see the top left corner of the page) is spelled with GIESSEN. The situation is even more complicated:

  • In Switzerland, "ss" is uniformly used instead of ß.

  • The recent spelling reform in Germany and Austria changed whether ß or ss is used in many words. For example, http://Schloß.de was the spelling before 1996, and http://Schloss.de is the spelling after.

  • In Unicode 5.1, an uppercase version of ß was added (ẞ), because it is attested in some (rare) cases. It is not now, however, the preferred uppercase of ß in German standards, nor is it known whether it will ever become the preferred uppercase. Unicode now treats all of these as a single equivalence class for case-insensitive matching: {ss, ß, SS, ẞ}. See also the Unicode FAQ.

  • Both the German and Austrian NICs (responsible for .de and .at, repectively) favored keeping the mapping from ß to "ss".

For full case insensitivity (with transitivity), {ss, ß, SS} and {σ, ς, Σ} need to be treated as equivalent, with one of each set chosen as the representative in the mapping. That is what is done in the Unicode Standard, which was followed by IDNA2003. While IDNA2003 did not have to have full case transitivity, that is water under the bridge.

ZWJ and ZWNJ are normally invisible, which allows them to be used for a variety of spoofs. Invisible characters (like these and soft-hyphen) are allowed on input in IDNA2003, but are deleted so that they do not allow spoofs.

Q: Why allow ZWJ/ZWNJ at all?

A: During the development of Unicode, the ZWJ and ZWNJ were intended only for presentation —that is, they would make no difference in the semantics of a word. Thus the IDNA2003 mapping should and does delete them. That result, however, should never really be seen by users—it should be just a transient form used for comparison. Unfortunately, the way IDN works this "comparison format" (with transformations of eszett, final sigma, and deleted ZWJ/ZWNJ) ends up being visible to the user, unless a display format is used that differs from the format used to transform for lookup.

For example, there are words such as the name of the country of Sri Lanka, which require preservation of these joiners (in this case, ZWJ) in order to appear correct to the end users in display after mapping.

Q: But aren't the deviation characters needed for the orthographies of some languages?

A: While these are full parts of the orthographies of the languages in question, neither IDNA2003 nor IDNA2008 ever claimed that all parts of every language's orthographies are representable in domain names. There are trivial examples even in English, like the word can't (vs cant) or Wendy's/Arby's Group, which use standard English orthography but cannot be represented faithfully in a domain name.

Q: Aren't the problems with eszett and final sigma just the same as with l, I, and 1?

A: No, The eszett and sigma are fundamentally different than I (capital i), l (lowercase L), and 1 (digit one). With the following (using a digit 1), all browsers will go to the same location, whether they old or new:

http://goog1e.com

In the following hypothetical example using a top-level domain "xx", browsers that use IDNA2003 will go to a different location than browsers that use a strict version of IDNA2008, unless the registry for xx puts into place a bundling strategy.

http://gießen.xx

The same goes for Greek sigma, which is a more common character in Greek than the eszett is in German.

Q: Why doesn't IDNA2008 (or for that matter IDNA2003 or UTS #46) restrict allowed domains on the basis of language?

A: It is extremely difficult to restrict on the basis of language, because the letters used in a particular language are not well defined. The "core" letters typically are, but many others are typically accepted in loan words, and have perfectly legitimate commercial and social use.

It is a bit easier to maintain a clear distinction based on script differences between characters: every Unicode character has a defined script (or is Common/Inherited). Even there it is problematic to have that as a restriction. Some languages, such as Japanese, require multiple scripts. And in most cases, mixtures of scripts are harmless. One can have http://SONY日本.com with no problems at all—while there are many cases of "homographs" (visually confusable characters) within the same script that a restriction based on script doesn't deal with.

The rough consensus among the IETF IDNA working group is that script/language mixing restrictions are not appropriate for the lowest-level protocol. So in this respect, IDNA2008 is no different than IDNA2003. IDNA doesn't try to attack the homograph problem, because it is too difficult to maintain a clear distinction. Effective solutions depend on information or capabilities outside of the protocol's control, such as language restrictions appropriate for a particular registry, the language of the user looking at this URL, the ability of a UI to display suspicious URLs with special highlighting, and so on.

Responsible registries can apply such restrictions. For example, a country-level registry can decide on a restricted set of characters appropriate for that country's languages. Application software can also take certain precautions—MSIE, Safari, and Chrome all display domain names in Unicode only if the user's language(s) typically use the scripts in those domain names. For more information on the kinds of techniques that implementations can use on the Unicode web site, see UTR #36: Unicode Security Considerations [UTR36].

Q: Are there differences in mapping between UTS #46 and IDNA2003?

A: No. There are, however, 56 characters that are valid or mapped under IDNA2003, but are disallowed by UTS #46. For a detailed table of differences between UTS #46 and IDNA2008, see Section 8, IDNA Comparison in UTS #46.

In particular, there are collections of characters that would have changed mapping according to NFKC_Casefold after Unicode 3.2, unless they were specifically excluded. All of these characters are extremely rare, and do not require any special handling.

Case Pairs. These are characters that did not have corresponding lowercase characters in Unicode 3.2, but had lowercase characters added later.

U+04C0 ( Ӏ ) CYRILLIC LETTER PALOCHKA
U+10A0 ( Ⴀ ) GEORGIAN CAPITAL LETTER AN…U+10C5 ( Ⴥ ) GEORGIAN CAPITAL LETTER HOE
U+2132 ( Ⅎ ) TURNED CAPITAL F
U+2183 ( Ↄ ) ROMAN NUMERAL REVERSED ONE HUNDRED

After Unicode 3.2, the Unicode Consortium has stabilized case folding, so that further examples will not occur in the future. That is, case pairs will be assigned in the same version of Unicode—so any newly assigned character will either have a case folding in that version of Unicode, or it will never have a case folding in the future.

Normalization Mappings. These are characters whose normalizations changed after Unicode 3.2 (all of them were in Unicode 4.0.0: see Corrigendum #4: Five Unihan Canonical Mapping Errors). As of Unicode 5.1, normalization is completely stabilized, so these are the only such characters.

U+2F868 ( 㛼 ) CJK COMPATIBILITY IDEOGRAPH-2F868 → U+2136A ( 𡍪 ) CJK UNIFIED IDEOGRAPH-2136A
U+2F874 ( 当 ) CJK COMPATIBILITY IDEOGRAPH-2F874 → U+5F33 ( 弳 ) CJK UNIFIED IDEOGRAPH-5F33
U+2F91F ( 𤎫 ) CJK COMPATIBILITY IDEOGRAPH-2F91F → U+43AB ( 䎫 ) CJK UNIFIED IDEOGRAPH-43AB
U+2F95F ( 竮 ) CJK COMPATIBILITY IDEOGRAPH-2F95F → U+7AAE ( 窮 ) CJK UNIFIED IDEOGRAPH-7AAE
U+2F9BF ( 䗗 ) CJK COMPATIBILITY IDEOGRAPH-2F9BF → U+4D57 ( 䵗 ) CJK UNIFIED IDEOGRAPH-4D57

Q: How do current implementations handle normalization for IDNA2003?

A: There were two corrigenda to normalization issued after Unicode 3.2. Formally speaking, an implementation applying IDNA2003 would disregard these corrigenda, but browsers do not consistently implement this behavior. In practice this makes no difference, since the characters and character sequences involved are not found except in specially-devised test cases, so it is understandable that systems may not want to maintain the extra code necessary to duplicate the broken Unicode 3.2 behavior.

Corrigendum #4: Five Unihan Canonical Mapping Errors

Corrigendum #5: Normalization Idempotency

 

Example

  • 2F868 (㛼) = xn--g22n
    • 3.2 normalization → xn--j74i = 2136A (𡍪)
    • 5.2 normalization → xn--snl = 36FC (㛼)

Example Behavior

  • IE/Chrome/Safari - 3.2
  • FF - 5.2

 

Corrigendum #5 deals with with a subtle algorithmic problem.

Example

  • 1100 0300 1161 0323 (ᄀ̀ᅡ̣) = xn--ksa4ez54cela
    • 3.2 normalization → xn--ksa4ez795d = AC00 0300 0323 (가̣̀)
      → xn--ksa3e0795d = AC00 0323 0300 (가̣̀)
    • 5.2 normalization → xn--ksa4ez54cela = 1100 0300 1161 0323 (ᄀ̀ᅡ̣)

Example Behavior

  • IE - 5.2
  • Chrome/Safari - 3.2
  • FF - 3.2 -- applied twice

As of Unicode 5.1, normalization was completely stabilized, so such changes will not happen in the future.

Q: What are possible strategies for preparing IDNs in a display form preferred by target sites?

A: Labels presented to a browser may or may not be in the display form preferred by a target site. For example, a site may have a preferred display form of “HumanEvents.com”, but an href tag in another site may display “HumaneVents.com”. Similarly, a user may type “Floß.com” in the browser’s address bar, and that would resolve to the site “floss.com”, though it is unclear whether the display form preferred by owners of that site is “Floss.com”, “floss.com”, “Floß.com”, or “floß.com”. There is no way currently for the browser to know whether the labels are in a preferred form or not.

It may be useful to develop mechanisms to allow browsers to determine the display form preferred by a target site, and then for browsers to display that form. One could foresee something being developed along the lines of the favicon approach. The mechanisms would need to have restrictions put into place to address misrepresentations. For example, the browser should verify that the site's preferred display form has the same lookup form: if the href is "http://βόλοσ.com", and the site's preferred display form is "http://Βόλος.com", then the preferred display form could be used; if the site's preferred display form is "http://Βόλλος.com", then it would not be used, because it doesn't have the same lookup form as the href. Other security checks should be made, such as to prevent display forms like "appIe.com" (with a capital I) for "appie.com" (with a lowercase i).

Q: How are label delimiters handled in implementations of IDNA?

A: The processing of UTS #46 matches what is commonly done with label delimiters by browsers, whereby characters containing periods are transformed into the NFKC format before labels are separated. This allows the domain name to be mapped in a single pass, rather than label by label. However, except for the four label separators provided by IDNA2003, all input characters that would map to a period are disallowed. For example, U+2488 ( ⒈ ) DIGIT ONE FULL STOP has a decomposition that maps to a period, and is thus disallowed. The exact list of characters can be seen with the Unicode utilities using a regular expression:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{toNFKC=/\./}

The question also arises as to how to handle escaped periods (such as %2E). While escaping of periods is outside of the scope of this document, it is useful to see how both of these cases are handled in current browsers:

Input
http://à%2Ecom
%2E
http://à⒈com
Internet Explorer http://xn--0ca.com/ = "." http://xn--1-rfa.com/ = "1."
Firefox http://www.xn--.com-hta.com/ ≠ "." http://xn--1-rfa.com/ = "1."
Safari / Chrome http://xn--0ca.com/ = "." http://xn--1.com-qqa/ ≠ "1."

There are three possible behaviors for characters such as U+2488 ( ⒈ ) DIGIT ONE FULL STOP:

  1. The dot behaves like a label separator.

  2. The character is rejected.

  3. The dot is included in the label, as shown in the garbled punycode seen above in the ≠ cases.

The conclusion of the Unicode Technical Committee was that the best behavior for UTS #46 was #2, to forbid all characters (other than the 4 label separators) that contained a FULL STOP in their compatibility decompositions. This is the same behavior as IDNA2003. Although this policy is not the current policy of the majority of browser implementations, the browser vendors agreed that the change is desirable.

Q: For IDNA2008, what is the derivation of valid characters in terms of Unicode properties?

Using formal set notation, the following describes the set of allowed characters defined by IDNA2008. This set corresponds to the union of the PVALID, CONTEXTJ, and CONTEXTO characters defined by the Tables document of IDNA2008.

Formal Sets Descriptions
[ \P{Changes_When_NFKC_Casefolded}

Start with characters that are NFKC Case folded (as in IDNA2003)

- \p{c} - \p{z}

Remove Control Characters and Whitespace (as in IDNA2003)

- \p{s} - \p{p} - \p{nl} - \p{no} - \p{me}

Remove Symbols, Punctuation, non-decimal Numbers, and Enclosing Marks

- \p{HST=L} - \p{HST=V} - \p{HST=V}

Remove characters used for archaic Hangul (Korean)

- \p{block=Combining_Diacritical_Marks_For_Symbols}
- \p{block=Musical_Symbols}
- \p{block=Ancient_Greek_Musical_Notation}

Remove three blocks of technical or archaic symbols.

- [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B]

Remove certain exceptions:
U+0640 ( ‎ـ‎ ) ARABIC TATWEEL
U+07FA ( ‎ߺ‎ ) NKO LAJANYALAN
U+302E ( 〮 ) HANGUL SINGLE DOT TONE MARK
U+302F ( 〯 ) HANGUL DOUBLE DOT TONE MARK
U+3031 ( 〱 ) VERTICAL KANA REPEAT MARK
U+3032 ( 〲 ) VERTICAL KANA REPEAT WITH VOICED SOUND MARK
..
U+3035 ( 〵 ) VERTICAL KANA REPEAT MARK LOWER HALF
U+303B ( 〻 ) VERTICAL IDEOGRAPHIC ITERATION MARK

+ [\u00B7 \u0375 \u05F3 \u05F4 \u30FB]
+ [\u002D \u06FD \u06FE \u0F0B \u3007]

Add certain exceptions:
U+00B7 ( · ) MIDDLE DOT
U+0375 ( ͵ ) GREEK LOWER NUMERAL SIGN
U+05F3 ( ‎׳‎ ) HEBREW PUNCTUATION GERESH
U+05F4 ( ‎״‎ ) HEBREW PUNCTUATION GERSHAYIM
U+30FB ( ・ ) KATAKANA MIDDLE DOT
plus
U+002D ( - ) HYPHEN-MINUS
U+06FD ( ‎۽‎ ) ARABIC SIGN SINDHI AMPERSAND
U+06FE ( ‎۾‎ ) ARABIC SIGN SINDHI POSTPOSITION MEN
U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG
U+3007 ( 〇 ) IDEOGRAPHIC NUMBER ZERO

+ [\u00DF \u03C2]
+ \p{JoinControl}]
Add special exceptions (Deviations):
U+00DF ( ß ) LATIN SMALL LETTER SHARP S
U+03C2 ( ς ) GREEK SMALL LETTER FINAL SIGMA
U+200C ( ) ZERO WIDTH NON-JOINER
U+200D ( ) ZERO WIDTH JOINER