|Version||1 (draft 1)|
|Authors||Mark Davis (email@example.com), Michel Suignard|
This document provides a specification for an internationalized domain name preprocessing step that is intended for use with IDNAbis, the projected update for Internationalized Domain Names. The proposed specification is compatible with IDNA2003 (the current version of Internationalized Domain Names), and consistently extend that mechanism for characters introduced in any later Unicode version.
At this point, IDNAbis is still in development, so this draft is based on the current draft of IDNAbis, and may change substantially as that draft changes.
This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
Because of the transformation step in IDNA2003, not only can we type
"Bücher.de", but we can also type any of "http://Bücher.de", "HTTP://BÜCHER.DE",
or "HTTP://BU¨CHER.DE" (where the ¨ represents a
U+0308 ( ̈ ) COMBINING DIAERESIS), or many other variations.
There is a projected update of IDNA2003 which is called either IDNAbis or IDNA2008 (the latter by those who think it could be adopted this year). There are quite a number of changes between these two versions: the one relevant to this document is that IDNAbis no longer has a casefolding and normalizing step. Instead, IDNAbis disallows any string that is not already case-folded and normalized. This means that strict adherence to IDNAbis, without any other action, would cause any of the above variant strings to fail. Thus typing "http://Bücher.de" would fail. (Domain names containing all ASCII characters, such as "Bucher.de", would continue to work even with case variations.)
However, when using the IDNAbis specification, many user agents such as browsers will have a requirement to interoperate compatibly with the prior IDNA2003 specification and/or operate in an environment that needs to allow lenient typing or parsing of internationalized domain names. In particular, there are many cases where an internationalized domain name is not formally allowed according to the relevant specifications, but there is widespread de facto use. For example, here is a chart showing the behavior of some major browsers with links containing IDNs.
The ü here is the decomposed form:
|4||<a href="http://B%C3%BCcher.de">||works||doesn't||%C3%BC is the UTF-8 version; this is being implemented in upcoming versions of browsers...|
The dot is a
The "1." are actually the single
Note that #6 is not formally provided for in IDNA2003, because the transformation is handled there on a label-by-label basis. However, this form is commonly supported (probably because it is just simpler to apply the transformation to the whole host name rather than to each label individually).
Firefox and IE7 both accept all of these forms (except #4, which is coming), and interpret them as equivalent. Because of that, and because of the substantial number of existing (and future) web pages that contain these formats, implementations will have little choice but to support a preprocessing step that allows for all of the forms, into the indefinite future. Note that this is not simply a typing or UI issue; these are in existing HTML pages. These forms are also parsed out of plain text; for example, most email clients parse for URLs (or IRIs, the internationalized version) and add links to them. So an occurrence of "http://Bücher.de" in the middle of plain text will often be transformed as well.
IDNAbis (current draft) allows for preprocessing (called local mapping), and even allows these to be different according to locale or application program. But we certainly don't want different programs (browsers, email clients, etc) to map these characters differently. That would cause a huge interoperability problem. Look, for example, at what could happen to different strings in the following table under locale-specific mappings.
|1||http://schaffer.de||legal as is|
|2||http://schaeffer.de||legal as is|
|http://schäffer.de||legal as is
|http://Schaffer.de||always legal, matches #1|
|http://Schäffer.de||Could fail, or map
to #1 for English, or #2 for German, etc.
|http://Schæffer.de||Could map to #4 for English, or other languages without æ|
|http://Schäﬀer.de||Could fail, or map
(The "ff" is
|http://Schäf➀fer.de||Could fail, or map
(The ➀ here represents the normally invisible:
|http://➀➁Schäffer➂.de||Could fail, or map
to #1 or #2.
(The ➀, ➁, and ➂ here represent the normally invisible:
An IDNA2008-conformant implementation could remap any of the items #4 to #9 in the Link Text column. using a local mapping -- or not, in which case they would fail. It could remove the illegal characters in #8 to #9, or not remove them and have the lookup fail. It could map the ligature ff to ff, or not. It could even decide, for example, based on locale linguistic mappings (using the UI language of the client, or the language of the email, or the default system language), that it could map #5 and #6 to different valid domain names, different than what IDNA2003 does. That means on the same page, a browser might go to different places depending on what the user's language was.
With IDNA2003, in contrast, the mappings for all of these are completely determinant (with all but the first being allowed, and the last being disallowed). Instead of a free-for-all of local mappings, what we need is a common mapping that maintains compatibility with IDNA2003, accomodates the newer Unicode characters, provides stability over time, and thus allows for interoperability among all programs that use it. In discussion with IETF people at the Ireland meeting, there appears to be general consensus that the specification of such a mapping does not belong in the IETF.
Note that lower-level protocols, such as the SMTP envelope, should require the strict use of already-transformed IDNs, and thus not use the preprocessing specified here. Language-specific modifications to the preprocessing specified in this document are outside the scope of this document; they are, however, very strongly discouraged because of the problems they pose for interoperability.
The input to the preprocessing is a domain_name string, which is a sequence of labels with dot separators, such as "Bücher.de". (For more about the parts of a URL, including the domain name, see http://tools.ietf.org/html/rfc3987). The preprocessing consists of the following steps, performed in order.
U+002E( . ) FULL STOP as the label delimiter.
Note that the processing matches what is commonly done with label delimiters (eg by browsers), whereby characters containing periods are NFKC'd before labels are separated. These characters can be seen with the Unicode utilities using a regular expression:
Note also that some browsers allow characters like "_" in domain names. Any such treatment is outside of the scope of this document.
Each version of Unicode will contain an updated version of this table: implementations will never need to actually use the algorithm for generating the tables -- they can just pick up the data and use them in the Preprocessing algorithm.
Note that the way that the IDNA Preprocessing Table is constructed, in order to ensure that isNFKC(output), it is sufficient to do toNFC(output). That is, the extra changes that are in NFKC but not in NFC are already in the table. It is also necessary to do at least toNFC(output), since otherwise the text may have unordered combining marks and/or uncomposed character sequences.
To use the table to map a string, walk through the string, one code point at a time. If there is a mapping entry for that code point, replace that code point with the result of the mapping entry. Otherwise retain the code point as is.
Informally, the table construction is done by mapping each Unicode character by applying casefolding and then normalization to Unicode Normalization Form KD (NFKD). There are some exceptional mappings that provide for compatibility with IDNA2003, and allow for special handling of future assigned characters. Those are listed in Section 3.3. Note that unassigned (reserved) code points never need an entry in the IDNA Preprocessing Table; they do not need to be included because their presence will cause an error anyway in the preprocessing.
Formally, the construction of the IDNA Preprocessing Table is specified as:
3.3.1. Removed (X => "")
These are specific mappings as part of IDNA2003, plus natural property extensions for post U3.2.
U+00AD( ) SOFT HYPHEN
U+034F( ) COMBINING GRAPHEME JOINER
U+1806( ᠆ ) MONGOLIAN TODO SOFT HYPHEN
U+200B( ) ZERO WIDTH SPACE
U+2060( ) WORD JOINER
U+FEFF( ) ZERO WIDTH NO-BREAK SPACE
This set is stabilized. That is, characters will only be added to the set, never removed -- and the only characters that will be added are those that are newly assigned.
3.3.2. Remapped (X => Y)
The following are specific mappings as part of IDNA2003, having to do with label separators.
Note: like IDNA2003, this set is quite limited.
We are only mapping those characters that are treated as full-stops
in CJK character sets. This does not include all characters that
function like full stops, nor do we map characters that look like
full stops but aren't. Note that because the preprocessing is done
to the entire domain_name string, in some cases
a dot may result from the decomposition of a character like
( ⒈ ) DIGIT ONE FULL STOP.
The IDNA Preprocessing Table is represented in the Unicode Character Database via two properties.
U+200C ( ) ZERO WIDTH NON-JOINER
U+200D ( ) ZERO WIDTH JOINER
U+04C0( Ӏ ) CYRILLIC LETTER PALOCHKA
U+10A0( Ⴀ ) GEORGIAN CAPITAL LETTER AN
U+10C5( Ⴥ ) GEORGIAN CAPITAL LETTER HOE
U+2132( Ⅎ ) TURNED CAPITAL F
U+2183( Ↄ ) ROMAN NUMERAL REVERSED ONE HUNDRED
These are characters whose normalizations changed after Unicode 3.2 (all of them were in Unicode 4.0.0). See Corrigendum #4: Five Unihan Canonical Mapping Errors. While the set of characters that are normalized to different values has been stable in Unicode, the results have not been. As of Unicode 5.1, normalization is completely stabilized, so these would be the first and last such characters.
[TBD Use full names, flesh out]. Thanks to Ken, Harald, Erik, and Patrik for many useful comments on previous drafts.
(informal, at this point)
The following summarizes modifications from the previous revisions of this document.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.