IDNAbis Preprocessing Draft
M. Davis, 2007-01-08
(live document at: http://docs.google.com/Doc?id=dfqr8rd5_51c3nrskcx)TBD: boilerplate, wordsmithing, references, fleshing out for clarity,...
This document provides a rough draft for a specification of an internationalized domain name preprocessing step that
is fully compatible with IDNA2003, and extends consistently to
characters introduced in any later Unicode version.
- IDNA Preprocessing Table
The IDNA2003 specifications (the IDNA base specification [RFC3490
], Nameprep [RFC3491
], Punycode [RFC3492
], and Stringprep
]) use a preprocessing step, which performs Unicode normalizations, lowercasing, and some other mapping). The IDNAbis specification does not provide such a preprocessing step, and only specifies what is "on the wire".
When using the IDNAbis specification, some user agents such as browsers may have a requirement to interoperate compatibly with the prior IDNA2003 specification and/or operate in an environment that needs to allow
lenient parsing of internationalized domain names. In the latter case, it may be that an internationalized domain name is not formally allowed according to the relevant specifications, but that there is widespread de facto
use. For example, here is a chart showing the behavior of some major browsers:
||But being implemented in browsers...
U+2488 ( ⒈ ) DIGIT ONE FULL STOP
Because Firefox and IE7
both accept these forms, and because of the substantial number of existing (and future) web pages that contain these formats, implementations will have no choice but to support a preprocessing step that allows for all of the forms, into the indefinite future. Note that this is not a UI issue; these are in an HTML page. The
more of the web and net's infrastructure that accepts these variations,
the more that other programs need to accommodate them, so that they
interwork with one another. For that we need a uniform specification that allows implementations to get the same results as IDNA2003, and also accommodate the newer Unicode characters.
To promote interoperability among user agents, the specification for such preprocessing is provided in this document.
Open Issue: we may want to also have a recommended a postprocessing step, to deal with final sigma.
Lower-level protocols, such as the SMTP envelope, should require the strict use of U-labels and thus not use the preprocessing specified here. Language-specific modifications to the preprocessing specified in this document are outside of the scope of this document; they are, however, discouraged because of the problems they pose for interoperability.
- Any characters legal in IDNAbis, if present in the input, are also present in the output (with the exception of characters that are composed via normalization).
- Where the user has a reasonable expectation that giving the character "X" as input will be treated as equivalent to "X'", and it's possible to determine this unambiguously, this is what should happen.
- Where the preprocessing results in an "abort with error", the input is not interpreted as convertible to a U-Label. In a user-interface, this should be indicated with a warning; in other processing (such as search-engine parsing of a web page), lookup would fail.
Three particular cases of compatibility with IDNA2003 are worth calling attention to.
U+00DF ( ß ) LATIN SMALL LETTER SHARP S -- IDNA2003 converts this to "ss", to allow for case-insensitivity between words like "STRASSE" and "straße".
U+03C2 ( ς ) GREEK SMALL LETTER FINAL SIGMA and
U+03A3 ( Σ ) GREEK CAPITAL LETTER SIGMA -- IDNA2003 converts both of these to
U+03C3 ( σ ) GREEK SMALL LETTER SIGMA, to allow for case-insensitivity between words like "χρήσης" and "ΧΡΉΣΗΣ" (without context-dependent case conversion). Note that if IDNA2003 had allowed for context-dependent case mapping, then Σ could have mapped to ς if not followed by a letter, preserving the distinction between σ and ς in most cases. IDNAbis is still under development, and changes there may require alterations here.
U+0130 ( İ ) LATIN CAPITAL LETTER I WITH DOT ABOVE and
U+0131 ( ı ) LATIN SMALL LETTER DOTLESS have special mappings. (Casefolding them as in Turkic languages would be incompatible with the regular case folding of
U+0049 ( I ) LATIN CAPITAL LETTER I and
U+0069 ( i ) LATIN SMALL LETTER I for all other Latin-based languages.)
To try out IDNA2003, see http://demo.icu-project.org/icu-bin/idnbrowser
The preprocessing consists of the following steps, performed in order. The input is a string that is intended to be interpreted as containing an IDN
Parse the input to get the host_name string.
Convert the host_name string to Unicode.
- Abort with error if there is any conversion problem.
- Abort with error if any are malformed (such as "\u123G").
- Note: this is only relevant for cases such as an IRI, in contexts such as HTML.
Convert any %-escapes in the host_name string according to IRI (eg, %2e becomes
( . ) FULL STOP)
Map the host_name string according to the IDNA Preprocessing Table (see below).
Normalize the host_name to Unicode Normalization Form C:
- Abort with error if malformed (eg, "%2" or the bytes are not allowed in UTF-8).
- Note: this is only relevant for cases such as an IRI, in contexts such as HTML.
Parse the host_name string into labels, using
- host_name = toNFC(host_name)
- Note: because of the construction of the table, characters are limited to those in NFKC, so this is equivalent to toNFKC().
U+002E ( . ) FULL STOP as the label delimiter.
Each label that contains only characters [\-a-zA-Z0-9] is an ASCII label. Each other label is processed according to the IDNAbis specification to convert to ASCII. That is:
Convert the label to ASCII according to the PunyCode specification.
Verify that the label complies with IDNAbis.
- Abort with error if invalid
- Note: ToASCII is only needed when preparing the IDN for an IDNA-ignorant slot (such as the DNS protocol). In other cases (such as the EAI UTF8SMTP), it's not needed.
Note that the processing matches what is commonly done with label delimiters, whereby characters containing periods are NFKC'd before
labels are separated. These characters can be seen with the Unicode utilities using a regular expression:
2. IDNA Preprocessing Table
This table provides a combined case folding and NFKC normalization, with some small modifications for IDNA2003 compatibility. This table will remain stable for all future versions of
Unicode; that is, no mappings will be changed, and any new mappings
will only be added for new assigned characters. There are more details
in each section below.
Note that the way that the IDNA Preprocessing Table is constructed, in order to ensure that isNFKC(output) it is sufficient to do toNFC(output). That is, the extra changes that are in NFKC that are not in NFC are already in the table. It is also necessary to do at least
toNFC(output), since otherwise the text may have unordered combining marks and/or uncomposed characters.
2.1 IDNA Preprocessing Table Usage
The IDNA Preprocessing Table, once constructed, consists of a set of mappings. Each mapping entry has a single code point as a source, and maps that code point to a result sequence of zero or more other code points.
To use the table to map a string, walk through the string, one code point at a time. If there is a mapping entry for that code point, replace that code point by the result of the mapping entry. Otherwise retain the code point as is.Most implementations will never need to know the algorithm for generating the tables -- they can just pick up generated tables and use them in the Preprocessing algorithm.
2.2 IDNA Preprocessing Table Construction
The IDNA Preprocessing Table in constructed as specified in this section.
Initially, the table is constructed based on Unicode 5.1. But a table for any version of Unicode subsequent to Unicode 5.1 can be constructed with exactly the same rules.
Informally, the table construction is done by mapping each Unicode character by applying casefolding and then normalization to Unicode Normalization Form KD (NFKD). However, there are some exceptional mappings and exclusions required for compatibility with IDNA2003. The exceptional mappings constitute a small list of characters that map to nothing in IDNA2003, plus full stops and a few normalization corrections requiring special handling. Those are listed completely in Section 2.3.
The exclusions constitute another small list of characters which map to themselves under IDNA2003 rules, but which do not map to themselves if casefolded and normalized by the Unicode 5.1 specification. These are listed completely in Section 2.2.
Note that unassigned (reserved) code points never get an entry in the IDNA Preprocessing Table.
Formally, the construction of the IDNA Preprocessing Table is specified as:
For each code point X:
- Exceptions. If X is in the IDNA Preprocessing Exceptions, use the mapping in that table, and continue with next code point
Exclusions. If X is in IDNA Preprocessing Exclusions, and continue with next code point
- Normalization and Casefolding.
- Z := X
a. Y := Z
b. Z := toNFKC(toCaseFold(Y))
until (Y == Z) // the maximum iterations required are two
- If X != Y
then add the mapping X => Y
else continue to the next code point without adding a mapping for X
- toCaseFolded and isCaseFolded are defined in the Unicode
Standard 5.0, Section 3.13 Default Case Algorithm, page 124, rule R4
and definition D12
- toNFKC and isNFKC are defined in Unicode Standard 5.0, UAX#15, Section X2 Notation, page 1339.
2.3 IDNA Preprocessing Exclusions
Exclude the following characters from mapping (in particular, casefolding), for compatibility with IDNA2003. That is, these characters will not be changed by the preprocessing.
( Ӏ ) CYRILLIC LETTER PALOCHKA
( Ⴀ ) GEORGIAN CAPITAL LETTER AN
( Ⴥ ) GEORGIAN CAPITAL LETTER HOE
( Ⅎ ) TURNED CAPITAL F
( Ↄ ) ROMAN NUMERAL REVERSED ONE HUNDRED
These are characters that didn't have lowercases in Unicode 3.2, but
had lowercase characters added later. Unicode has since stabilized case
folding, so that this won't happen in the future. That is, case pairs
will be assigned in the same version of Unicode -- so any newly
assigned character will either have a casefolding in that version of
Unicode, or it will never have a casefolding in the future.Open Issue:
If we want a "cleaner" preprocessing of these characters, and are willing to break compability with IDNA2003 for them, we can remove them from this list, thus allowing them to be casefolded.
2.3 IDNA Preprocessing Exceptions
For compatibility with IDNA2003, include the following mappings. The notation [:xxx:] means a Unicode property value. A mapping is expressed as X => Y, where X is a single code point, and Y is a sequence of zero or more other code points.
2.3.1. Remove (map to an empty sequence) the following characters
These are specific mappings as part of IDNA2003.
( ) SOFT HYPHEN
( ) COMBINING GRAPHEME JOINER
( ᠆ ) MONGOLIAN TODO SOFT HYPHEN
( ) ZERO WIDTH SPACE
( ) WORD JOINER
( ) ZERO WIDTH NO-BREAK SPACE
and Variation Selectors
In UnicodeSet notation: [\u034F\u200B-\u200D\u2060\uFEFF\u00AD [:variation_selector:]]
Note: the following characters were ignored in IDNA2003. They are allowed in IDNAbis in limited contexts and otherwise ignored.
( ) ZERO WIDTH NON-JOINER
( ) ZERO WIDTH JOINER
In UnicodeSet notation: [\u200C \u200D]
2.3.2. Full Stops
These are specific mappings as part of IDNA2003, having to do with label separators.
( 。 ) IDEOGRAPHIC FULL STOP (and anything mapped to it by toNFKC) to
( . ) FULL STOP. That is:
( 。 ) IDEOGRAPHIC FULL STOP
( . ) FULL STOP
( ｡ ) HALFWIDTH IDEOGRAPHIC FULL STOP
( . ) FULL STOPNote:
like IDNA2003, this set is quite limited. We are only mapping those characters that are treated as full-stops in CJK character sets. This does not include all characters that function like full stops, nor do we map characters that look like full stops but aren't.
2.3.3. Retain Corrigendum #4: Five Unihan Canonical Mapping Errors
These are characters whose normalizations changed after Unicode 3.2 (all of them were in Unicode 4.0.0). While the set of characters that are normalized to different values has been stable in Unicode, the results have not been. We anticipate that as of Unicode 5.1, normalization will be completely stabilized, so these would be the first and last such characters.
U+2F868 ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F868
U+2136A ( ? ) CJK UNIFIED IDEOGRAPH-2136A
U+2F874 ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F874
=> U+5F33 ( ? ) CJK UNIFIED IDEOGRAPH-5F33
U+2F91F ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F91F
=> U+43AB ( ? ) CJK UNIFIED IDEOGRAPH-43AB
U+2F95F ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F95F
U+7AAE ( ? ) CJK UNIFIED IDEOGRAPH-7AAE
U+2F9BF ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F9BF
=> U+4D57 ( ? ) CJK UNIFIED IDEOGRAPH-4D57
References (informal, at this point)
Thanks to Harald for many useful comments.