IDNAbis Preprocessing Draft
M. Davis, 2008-02-05; updated 2008-08-05
This document proposes the development of a specification for an internationalized domain name preprocessing step that
is intended for use with IDNAbis, the projected update for Internationalized Domain Names, and includes a rough draft of what such a specification might look like. The proposed specification would be compatible with IDNA2003 (the current version of Internationalized Domain Names), and consistently extend that mechanism for
characters introduced in any later Unicode version.
- IDNA Preprocessing Table
- Exclusion Table
- UCD Representation
Until 2003, domain names could only contain ASCII letters. The Internationalized Domain Name specifications adopted by the IETF in 2003 allow for Unicode characters in domain names. For example, one can now type in "http://bücher.de" into the address bar of any modern browser, and it will go to a corresponding site, even though the "ü" is not an ASCII character. Internally, this is handled by transforming the string into a case-folded and normalized (NFKC) form, then mapping it to a sequence of characters using a transformation known as Punycode. For this case, the internal value is actually "http://xn--bcher-kva.de". The specifications for this are called the IDNA2003 specifications, which includes: the IDNA base specification [RFC3490
], Nameprep [RFC3491
], Punycode [RFC3492
], and Stringprep [RFC3454
Because of the transformation step in IDNA2003, not only can we type "Bücher.de", but we can also type any of "http://Bücher.de", "HTTP://BÜCHER.DE", or "HTTP://BU¨CHER.DE" (where the ¨ represents a
( ̈ ) COMBINING DIAERESIS), or many other variations.
There is a projected update of IDNA2003 which is called either IDNAbis or IDNA2008 (the latter by those who think it could be adopted this year). There are quite a number of changes between these two versions: the one relevant to this document is that IDNAbis no longer has a casefolding and normalizing step. Instead, IDNAbis disallows any string that is not already case-folded and normalized. This means that strict adherence to IDNAbis, without any other action, would cause any of the above variant strings to fail. Thus typing "http://Bücher.de" would fail. (Strings containing all ASCII characters, such as "Bucher.de", would continue to work even with case variations.)
However, when using the IDNAbis specification, many user agents such as browsers will have a requirement to interoperate compatibly with the prior IDNA2003 specification and/or operate in an environment that needs to allow
lenient typing or parsing of internationalized domain names. In particular, there are many cases where an internationalized domain name is not formally allowed according to the relevant specifications, but there is widespread de facto
use. For example, here is a chart showing the behavior of some major browsers with links containing IDNs.
Browser Interpretation of IDNs
||%C3%BC is the UTF-8 version;
this is being implemented in upcoming versions of browsers...
|5||<a href="http://Bücher．com">||works||works||The dot is a |
U+FF0E ( ． ) FULLWIDTH FULL STOP
||The "1." are actually the single
U+2488 ( ⒈ ) DIGIT ONE FULL STOP,
so this maps to http://bücher1.de
Note that #6 is not formally provided for in IDNA2003, because the transformation is handled there on a label-by-label basis. However, this form is commonly supported (probably because it is just simpler to apply the transformation to the whole host name rather than to each label individually).
Firefox and IE7
both accept all of these forms (except #4, which is coming), and interpret them as equivalent. Because of that, and because of the substantial number of existing (and future) web pages that contain these formats, implementations will have little choice but to support a preprocessing step that allows for all of the forms, into the indefinite future. Note that this is not simply a typing or UI issue; these are in existing HTML pages. These forms are also parsed out of plain text; for example, most email clients parse for URLs (or IRIs, the internationalized version) and add links to them. So an occurrence of "http://Bücher.de" in the middle of plain text will often be transformed as well.
IDNAbis (current draft) allows for preprocessing (called local mapping), and even allows these to be different account to locale or program. But we certainly don't want different programs (browsers, email clients, etc) to map these characters differently. That would cause a huge interoperability problem. Look, for example, at what could happen to different strings in the following table under locale-specific mappings.
Possible Local Mapping Variations
||legal as is
||legal as is
|3||http://schäffer.de||legal as is|
||always legal, matches #1
||Could fail, or map to #1 for English, or #2 for German, etc.
|6||http://Schæffer.de||Could map to #4 for English, or other languages without æ|
|7||http://Schäﬀer.de||Could fail, or map to #1.|
(The "ff" is
U+FB00 ( ﬀ ) LATIN SMALL LIGATURE FF)
||Could fail, or map to #1.|
(The ➀ here represents the normally invisible:
U+00AD SOFT HYPHEN)
|9||http://➀➁Schäffer➂.de||Could fail, or map to #1 or #2.|
(The ➀, ➁, and ➂ here represent the normally invisible:
U+E0065 TAG LATIN SMALL LETTER E
U+E006E TAG LATIN SMALL LETTER N
U+E007F CANCEL TAG)
IDNA2008-conformant implementation could remap any of the items #4 to #9 in the Link Text
column. using a
local mapping -- or not, in which case they would fail. It could remove
the illegal characters in #8 to #9, or not remove them and have the
lookup fail. It could map the ligature ff to ff, or not. It could even
decide, for example, based on locale linguistic mappings (using the UI
language of the client, or the language of the email, or the default
system language), that it could map #5 and #6 to different valid domain names, different than what IDNA2003 does. That means on the same page, a browser might go to different places depending on what the user's language was.
With IDNA2003, in contrast, the mappings for all of these are
(with all but the first being allowed, and the
last being disallowed). Instead of a free-for-all of local mappings, what we need
is a common mapping that maintains compatibility with IDNA2003,
accomodates the newer Unicode characters, provides stability over time,
and thus allows for interoperability among all programs that use it. In
discussion with IETF people at the Ireland meeting, there appears to be
general consensus that the specification of such a mapping does not
belong in the IETF.
This document proposes that the Unicode Consortium provide such a specification, as provided for in this document.
Note that lower-level protocols, such as the SMTP envelope, should require the strict use of already-transformed IDNs, and thus not use the preprocessing specified here. Language-specific modifications to the preprocessing specified in this document are outside the scope of this document; they are, however, very strongly discouraged because of the problems they pose for interoperability.
- The exact formulation for IDNAbis is not final yet, and we would not want to release a mapping specification until it is final.
- We may want to also have a recommended postprocessing step, to deal with final sigma.
- The could add the assigned U5.1 Default-Ignorable characters (except joiners) to the Removals (3.3.2). (See below.)
- The mapping should be as compatible as possible with IDNA2003. To try out IDNA2003, see http://demo.icu-project.org/icu-bin/idnbrowser .
- Where the user has a reasonable expectation that giving the character "X" as input will be treated as equivalent to "X'", and it's possible to determine this unambiguously, this is what should happen.
- Where the preprocessing results in an "abort with error", the input is not interpreted as valid. In a user-interface, this should be indicated with a warning; in other processing (such as search-engine parsing of a web page), lookup would fail.
The input to the preprocessing is a domain_name
string, which is a sequence of labels with dot separators, such as "Bücher.de". (For more about the parts of a URL, including the domain name, see http://tools.ietf.org/html/rfc3987
). The preprocessing consists of the following steps, performed in order.
Convert the input domain_name string to Unicode.
Map the domain_name string according to the IDNA Preprocessing Table (see below).
Normalize the domain_name string to Unicode Normalization Form C:
- Abort with error if there is any conversion problem.
Parse the domain_name string into labels, using
- domain_name = toNFC(domain_name)
- Note: because of the construction of the table, characters are limited to those already allowed by NFKC, so this is equivalent to toNFKC().
U+002E ( . ) FULL STOP as the label delimiter.
Verify that each label in the domain_name complies with IDNAbis.
- Note that the dot may have resulted from a mapping from other characters, such as
U+2488 ( ⒈ ) DIGIT ONE FULL STOP or
U+FF0E ( ． ) FULLWIDTH FULL STOP. See below.
Return the string resulting from the successive application of the above steps.
- Abort with error if it does not comply
- Each label that contains only characters [\-a-zA-Z0-9] is an ASCII
label. Each other label must conform to the IDNAbis specification.
Note that the processing matches what is commonly done with label delimiters, whereby characters containing periods are NFKC'd before
labels are separated. These characters can be seen with the Unicode utilities using a regular expression:
Note also that some browsers allow characters like "_" in domain names. Any such treatment is outside of the scope of this document.
3. IDNA Preprocessing Table
This mapping table provides a combined case folding and NFKC normalization, with some small modifications for IDNA2003 compatibility. The values in the table will remain stable for all future versions of
Unicode; that is, no mappings will be changed, and any new mappings
will only be added for newly assigned characters. There are more details
in each section below.Each version of Unicode would contain an updated version of this table: implementations
will never need to actually use the algorithm for generating the tables
-- they can just pick up the data and use them in the Preprocessing algorithm.
Note that the way that the IDNA Preprocessing Table is constructed, in
order to ensure that isNFKC(output), it is sufficient to do
toNFC(output). That is, the extra changes that are in NFKC but not in
NFC are already in the table. It is also necessary to do at least
toNFC(output), since otherwise the text may have unordered combining marks and/or uncomposed character sequences.
The IDNA Preprocessing Table consists of a set of mappings from single code points to a sequence of zero or more other code points, also referred to as a 'table'. All code points that are not specifically entered into the table are
mapped to themselves.
To use the table to map a string, walk through the string, one code point at a time. If there is a mapping entry for that code point, replace that code point with the result of the mapping entry. Otherwise retain the code point as is.
The IDNA Preprocessing Table is constructed as specified in this section, for each version of Unicode. Post Unicode 5.0, case folding and normalization are always backwards compatible. The only issue for any new release of Unicode is whether any unassigned characters needed to be added to the exception table.
Informally, the table construction is done by mapping each Unicode character by applying casefolding and then normalization to Unicode Normalization Form KD (NFKD). There are some exceptional mappings that provide for compatibility with IDNA2003, and allow for special handling of future assigned characters. Those are listed in Section 3.3. Note that unassigned (reserved) code points never need an entry in the IDNA Preprocessing Table; they do not need to be included because their presence will cause an error anyway in the preprocessing.
Formally, the construction of the IDNA Preprocessing Table is specified as:
For each code point X:
- If X is in the IDNA Preprocessing Exceptions, add the mapping found in that table
- Else add a mapping from X to toNFKC(toCaseFold(toNFKC(X)))
- toCaseFold and isCaseFolded are defined in the Unicode
Standard 5.0, Section 3.13 Default Case Algorithms, page 125, rule R4
and definition D127
- toNFKC and isNFKC are defined in Unicode Standard 5.0, UAX#15, Section X2 Notation, page 1339.
3.3 Exception Table
The following is an exhaustive list of the items in the Exception Table. The notation [:xxx:] means a Unicode property value. A mapping is
expressed as X => Y, where X is a single code point, and Y is a
sequence of zero or more other code points.
3.3.1. Removed (X => "")
These are specific mappings as part of IDNA2003.
( ) SOFT HYPHEN
( ) COMBINING GRAPHEME JOINER
( ᠆ ) MONGOLIAN TODO SOFT HYPHEN
( ) ZERO WIDTH SPACE
( ) WORD JOINER
( ) ZERO WIDTH NO-BREAK SPACE
and Variation Selectors
In UnicodeSet notation: [\u034F\u200B-\u200D\u2060\uFEFF\u00AD [:variation_selector:]]
Note: the following characters were mapped to nothing (deleted) in IDNA2003. They are allowed in IDNAbis in limited contexts and otherwise deleted.
( ) ZERO WIDTH NON-JOINER
( ) ZERO WIDTH JOINER
In UnicodeSet notation: [\u200C \u200D]
3.3.2. Remapped Full Stops (X => Y)
These are specific mappings as part of IDNA2003, having to do with label separators.
( 。 ) IDEOGRAPHIC FULL STOP (and anything mapped to it by toNFKC) to
( . ) FULL STOP. That is:
( 。 ) IDEOGRAPHIC FULL STOP
( . ) FULL STOP
( ｡ ) HALFWIDTH IDEOGRAPHIC FULL STOP
( . ) FULL STOPNote:
like IDNA2003, this set is quite limited. We are only mapping those characters that are treated as full-stops in CJK character sets. This does not include all characters that function like full stops, nor do we map characters that look like full stops but aren't. Note that because the preprocessing is done to the entire domain_name
string, in some cases a dot may result from the dccomposition of a character like
U+2488 ( ⒈ ) DIGIT ONE FULL STOP.
3.3.3. Mapping Suppressed (X => X)
These are characters that didn't have lowercases in Unicode 3.2, but had lowercase characters added later. Unicode has since stabilized case folding, so that this won't happen in the future. That is, case pairs will be assigned in the same version of Unicode -- so any newly assigned character will either have a casefolding in that version of Unicode, or it will never have a casefolding in the future.
( Ӏ ) CYRILLIC LETTER PALOCHKA
( Ⴀ ) GEORGIAN CAPITAL LETTER AN
( Ⴥ ) GEORGIAN CAPITAL LETTER HOE
( Ⅎ ) TURNED CAPITAL F
( Ↄ ) ROMAN NUMERAL REVERSED ONE HUNDRED
3.3.4. Retained Corrigendum #4: Five Unihan Canonical Mapping Errors (X => Y)
These are characters whose normalizations changed after Unicode 3.2 (all of them were in Unicode 4.0.0). While the set of characters that are normalized to different values has been stable in Unicode, the results have not been. As of Unicode 5.1, normalization is completely stabilized, so these would be the first and last such characters.
U+2F868 ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F868
U+2136A ( ? ) CJK UNIFIED IDEOGRAPH-2136A
U+2F874 ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F874
=> U+5F33 ( ? ) CJK UNIFIED IDEOGRAPH-5F33
U+2F91F ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F91F
=> U+43AB ( ? ) CJK UNIFIED IDEOGRAPH-43AB
U+2F95F ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F95F
U+7AAE ( ? ) CJK UNIFIED IDEOGRAPH-7AAE
U+2F9BF ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F9BF
=> U+4D57 ( ? ) CJK UNIFIED IDEOGRAPH-4D57
There was also an algorithmic correction to normalization, but it is so extremely unlikely as to affect any strings in practice that we don't think it is worth trying to capture at all. It is an open issue as to whether even the above 5 are worth trying to special-case.
4. UCD Representation
The IDNA Preprocessing Table is represented in the Unicode Character Database via two properties.
- A contributory property, Other_Idna_Mapping, which contains the exceptional values.
- A derived property, Idna_Mapping (IM), which contains the full mapping table.
5. References (informal, at this point)30 Jul 2008 draft-ietf-idnabis-bidi
28 Jul 2008 draft-ietf-idnabis-protocol
14 Jul 2008 draft-ietf-idnabis-rationale
14 Jul 2008 draft-ietf-idnabis-tables
Thanks to Harald, Erik, and Patrik for many useful comments.