IDNAbis Preprocessing Draft

M. Davis, 2007-01-08
(live document at: http://docs.google.com/Doc?id=dfqr8rd5_51c3nrskcx)
TBD: boilerplate, wordsmithing, references, fleshing out for clarity,...

Summary

This document provides a rough draft for a specification of an internationalized domain name preprocessing step that is fully compatible with IDNA2003, and extends consistently to characters introduced in any later Unicode version.

Preprocessing
IDNA Preprocessing Table

The IDNA2003 specifications (the IDNA base specification [RFC3490], Nameprep [RFC3491], Punycode [RFC3492], and Stringprep

[RFC3454]) use a preprocessing step, which performs Unicode normalizations, lowercasing, and some other mapping). The IDNAbis specification does not provide such a preprocessing step, and only specifies what is "on the wire".

When using the IDNAbis specification, some user agents such as browsers may have a requirement to interoperate compatibly with the prior IDNA2003 specification and/or operate in an environment that needs to allow lenient parsing of internationalized domain names. In the latter case, it may be that an internationalized domain name is not formally allowed according to the relevant specifications, but that there is widespread de facto use. For example, here is a chart showing the behavior of some major browsers:

	Link	Firefox	IE7	IDNA2003 U-Labels	Comments
1	<a href="http://xn--bcher-kva.de">	works	works	yes
2	<a href="http://bücher.de">	works	works	no
3	<a href="http://Bücher.de">	works	works	no
4	<a href="http://B%C3%BCcher.de">	doesn't	doesn't	no	But being implemented in browsers...
5	<a href="http://Bücher⒈com">	works	works	no	Uses `U+2488` ( ⒈ ) DIGIT ONE FULL STOP

Because Firefox and IE7 both accept these forms, and because of the substantial number of existing (and future) web pages that contain these formats, implementations will have no choice but to support a preprocessing step that allows for all of the forms, into the indefinite future. Note that this is not a UI issue; these are in an HTML page. The more of the web and net's infrastructure that accepts these variations, the more that other programs need to accommodate them, so that they interwork with one another. For that we need a uniform specification that allows implementations to get the same results as IDNA2003, and also accommodate the newer Unicode characters.

To promote interoperability among user agents, the specification for such preprocessing is provided in this document.

Open Issue: we may want to also have a recommended a postprocessing step, to deal with final sigma.

Lower-level protocols, such as the SMTP envelope, should require the strict use of U-labels and thus not use the preprocessing specified here. Language-specific modifications to the preprocessing specified in this document are outside of the scope of this document; they are, however, discouraged because of the problems they pose for interoperability.

Requirements

Any characters legal in IDNAbis, if present in the input, are also present in the output (with the exception of characters that are composed via normalization).
Where the user has a reasonable expectation that giving the character "X" as input will be treated as equivalent to "X'", and it's possible to determine this unambiguously, this is what should happen.
Where the preprocessing results in an "abort with error", the input is not interpreted as convertible to a U-Label. In a user-interface, this should be indicated with a warning; in other processing (such as search-engine parsing of a web page), lookup would fail.

Three particular cases of compatibility with IDNA2003 are worth calling attention to.

U+00DF ( ß ) LATIN SMALL LETTER SHARP S -- IDNA2003 converts this to "ss", to allow for case-insensitivity between words like "STRASSE" and "straße".
U+03C2 ( ς ) GREEK SMALL LETTER FINAL SIGMA and U+03A3 ( Σ ) GREEK CAPITAL LETTER SIGMA -- IDNA2003 converts both of these to U+03C3 ( σ ) GREEK SMALL LETTER SIGMA, to allow for case-insensitivity between words like "χρήσης" and "ΧΡΉΣΗΣ" (without context-dependent case conversion). Note that if IDNA2003 had allowed for context-dependent case mapping, then Σ could have mapped to ς if not followed by a letter, preserving the distinction between σ and ς in most cases. IDNAbis is still under development, and changes there may require alterations here.
U+0130 ( İ ) LATIN CAPITAL LETTER I WITH DOT ABOVE and U+0131 ( ı ) LATIN SMALL LETTER DOTLESS have special mappings. (Casefolding them as in Turkic languages would be incompatible with the regular case folding of U+0049 ( I ) LATIN CAPITAL LETTER I and U+0069 ( i ) LATIN SMALL LETTER I for all other Latin-based languages.)

To try out IDNA2003, see http://demo.icu-project.org/icu-bin/idnbrowser .

1. Preprocessing

The preprocessing consists of the following steps, performed in order. The input is a string that is intended to be interpreted as containing an IDN

Parse the input to get the host_name string.

Abort with error if not found.
Note: this is only relevant for cases such as IRI.

Convert the host_name string to Unicode.

Abort with error if there is any conversion problem.

Convert any escapes in the host_name string to Unicode code points as necessary, depending on context (eg, HTML NCRs like 十 or Javascript escapes like \u5341).

Abort with error if any are malformed (such as "\u123G").
Note: this is only relevant for cases such as an IRI, in contexts such as HTML.

Convert any %-escapes in the host_name string according to IRI (eg, %2e becomes U+002E ( . ) FULL STOP)

Abort with error if malformed (eg, "%2" or the bytes are not allowed in UTF-8).
Note: this is only relevant for cases such as an IRI, in contexts such as HTML.

Map the host_name string according to the IDNA Preprocessing Table (see below).
Normalize the host_name to Unicode Normalization Form C:

host_name = toNFC(host_name)
Note: because of the construction of the table, characters are limited to those in NFKC, so this is equivalent to toNFKC().

Parse the host_name string into labels, using U+002E ( . ) FULL STOP as the label delimiter.
Each label that contains only characters [\-a-zA-Z0-9] is an ASCII label. Each other label is processed according to the IDNAbis specification to convert to ASCII. That is:

Verify that the label complies with IDNAbis.

Abort with error if not.

Convert the label to ASCII according to the PunyCode specification.

label = ToASCII(label).

Abort with error if invalid
Note: ToASCII is only needed when preparing the IDN for an IDNA-ignorant slot (such as the DNS protocol). In other cases (such as the EAI UTF8SMTP), it's not needed.

Note that the processing matches what is commonly done with label delimiters, whereby characters containing periods are NFKC'd before labels are separated. These characters can be seen with the Unicode utilities using a regular expression:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:toNFKC=/\./:]
actually some characters would be effectively forbidden, because they would result in a sequence of two periods, and thus empty labels.

2. IDNA Preprocessing Table

This table provides a combined case folding and NFKC normalization, with some small modifications for IDNA2003 compatibility. This table will remain stable for all future versions of Unicode; that is, no mappings will be changed, and any new mappings will only be added for new assigned characters. There are more details in each section below.

Note that the way that the IDNA Preprocessing Table is constructed, in order to ensure that isNFKC(output) it is sufficient to do toNFC(output). That is, the extra changes that are in NFKC that are not in NFC are already in the table. It is also necessary to do at least toNFC(output), since otherwise the text may have unordered combining marks and/or uncomposed characters.

2.1 IDNA Preprocessing Table Usage

The IDNA Preprocessing Table, once constructed, consists of a set of mappings. Each mapping entry has a single code point as a source, and maps that code point to a result sequence of zero or more other code points.

To use the table to map a string, walk through the string, one code point at a time. If there is a mapping entry for that code point, replace that code point by the result of the mapping entry. Otherwise retain the code point as is.

Most implementations will never need to know the algorithm for generating the tables -- they can just pick up generated tables and use them in the Preprocessing algorithm.

2.2 IDNA Preprocessing Table Construction

The IDNA Preprocessing Table in constructed as specified in this section.

Initially, the table is constructed based on Unicode 5.1. But a table for any version of Unicode subsequent to Unicode 5.1 can be constructed with exactly the same rules.

Informally, the table construction is done by mapping each Unicode character by applying casefolding and then normalization to Unicode Normalization Form KD (NFKD). However, there are some exceptional mappings and exclusions required for compatibility with IDNA2003. The exceptional mappings constitute a small list of characters that map to nothing in IDNA2003, plus full stops and a few normalization corrections requiring special handling. Those are listed completely in Section 2.3.

The exclusions constitute another small list of characters which map to themselves under IDNA2003 rules, but which do not map to themselves if casefolded and normalized by the Unicode 5.1 specification. These are listed completely in Section 2.2.

Note that unassigned (reserved) code points never get an entry in the IDNA Preprocessing Table.

Formally, the construction of the IDNA Preprocessing Table is specified as:

For each code point X:

Exceptions. If X is in the IDNA Preprocessing Exceptions, use the mapping in that table, and continue with next code point
Exclusions. If X is in IDNA Preprocessing Exclusions, and continue with next code point
Normalization and Casefolding.

Z := X
Do
   a. Y := Z
   b. Z := toNFKC(toCaseFold(Y))
until (Y == Z)                  // the maximum iterations required are two
If X != Y
then add the mapping X => Y
else continue to the next code point without adding a mapping for X

Note:

toCaseFolded and isCaseFolded are defined in the Unicode Standard 5.0, Section 3.13 Default Case Algorithm, page 124, rule R4 and definition D12

(also http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#G34078)

toNFKC and isNFKC are defined in Unicode Standard 5.0, UAX#15, Section X2 Notation, page 1339.

(also http://www.unicode.org/reports/tr15/#Notation)

2.3 IDNA Preprocessing Exclusions

Exclude the following characters from mapping (in particular, casefolding), for compatibility with IDNA2003. That is, these characters will not be changed by the preprocessing.

U+04C0 ( Ӏ ) CYRILLIC LETTER PALOCHKA
U+10A0 ( Ⴀ ) GEORGIAN CAPITAL LETTER AN
…{36}…U+10C5 ( Ⴥ ) GEORGIAN CAPITAL LETTER HOE
U+2132 ( Ⅎ ) TURNED CAPITAL F
U+2183 ( Ↄ ) ROMAN NUMERAL REVERSED ONE HUNDRED

These are characters that didn't have lowercases in Unicode 3.2, but had lowercase characters added later. Unicode has since stabilized case folding, so that this won't happen in the future. That is, case pairs will be assigned in the same version of Unicode -- so any newly assigned character will either have a casefolding in that version of Unicode, or it will never have a casefolding in the future.

Open Issue: If we want a "cleaner" preprocessing of these characters, and are willing to break compability with IDNA2003 for them, we can remove them from this list, thus allowing them to be casefolded.

2.3 IDNA Preprocessing Exceptions

For compatibility with IDNA2003, include the following mappings. The notation [:xxx:] means a Unicode property value. A mapping is expressed as X => Y, where X is a single code point, and Y is a sequence of zero or more other code points.

2.3.1. Remove (map to an empty sequence) the following characters

These are specific mappings as part of IDNA2003.

U+00AD ( ) SOFT HYPHEN
U+034F ( ) COMBINING GRAPHEME JOINER
U+1806 ( ᠆ ) MONGOLIAN TODO SOFT HYPHEN
U+200B ( ) ZERO WIDTH SPACE
U+2060 ( ) WORD JOINER
U+FEFF ( ) ZERO WIDTH NO-BREAK SPACE
and Variation Selectors

In UnicodeSet notation: [\u034F\u200B-\u200D\u2060\uFEFF\u00AD [:variation_selector:]]

Note: the following characters were ignored in IDNA2003. They are allowed in IDNAbis in limited contexts and otherwise ignored.

U+200C ( ) ZERO WIDTH NON-JOINER
U+200D ( ) ZERO WIDTH JOINER

In UnicodeSet notation: [\u200C \u200D]

2.3.2. Full Stops

These are specific mappings as part of IDNA2003, having to do with label separators.

Map U+3002 ( 。 ) IDEOGRAPHIC FULL STOP (and anything mapped to it by toNFKC) to U+002E ( . ) FULL STOP. That is:

U+3002 ( 。 ) IDEOGRAPHIC FULL STOP
=> U+002E ( . ) FULL STOP

U+FF61 ( ｡ ) HALFWIDTH IDEOGRAPHIC FULL STOP
=> U+002E ( . ) FULL STOP

Note: like IDNA2003, this set is quite limited. We are only mapping those characters that are treated as full-stops in CJK character sets. This does not include all characters that function like full stops, nor do we map characters that look like full stops but aren't.

2.3.3. Retain Corrigendum #4: Five Unihan Canonical Mapping Errors

These are characters whose normalizations changed after Unicode 3.2 (all of them were in Unicode 4.0.0). While the set of characters that are normalized to different values has been stable in Unicode, the results have not been. We anticipate that as of Unicode 5.1, normalization will be completely stabilized, so these would be the first and last such characters.

U+2F868 ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F868
=> U+2136A ( ? ) CJK UNIFIED IDEOGRAPH-2136A

U+2F874 ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F874
=> U+5F33 ( ? ) CJK UNIFIED IDEOGRAPH-5F33

U+2F91F ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F91F
=> U+43AB ( ? ) CJK UNIFIED IDEOGRAPH-43AB

U+2F95F ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F95F
=> U+7AAE ( ? ) CJK UNIFIED IDEOGRAPH-7AAE

U+2F9BF ( ? ) CJK COMPATIBILITY IDEOGRAPH-2F9BF
=> U+4D57 ( ? ) CJK UNIFIED IDEOGRAPH-4D57

References (informal, at this point)

Thanks to Harald for many useful comments.

L2/08-100