L2/07-073

Network Working Group	M. Suignard, Ed.
Internet-Draft	Microsoft Corporation
Intended status: Standards Track	M. Davis
Expires: August 11, 2007	Google
	A. Freytag
	ASMUS Inc.
	February 7, 2007

Preparation of Internationalized Domain Names (idnaprep)
draft-suignard-idnaprep-00

Status of this Memo

By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”

The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt.

The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.

This Internet-Draft will expire on August 11, 2007.

Copyright Notice

Abstract

This document describes how to prepare internationalized domain name (IDN) labels in order to increase the likelihood that name input and name comparison work in ways that make sense for typical users throughout the world.

This document may replace the functionality described in RFC3491 (nameprep), but unlike the former does not rely on a framework such as stringprep.

1. Introduction
    1.1. Terminology
    1.2. Using idnaprep in protocols
2. Preparation Overview
3. Idnaprep character repertoire
4. Mapping of Joiner and Non Joiner characters
5. Normalization
6. Combining Marks
7. Bidirectional Characters
8. Unassigned Code Points in idnaprep
    8.1. idnaprep profiles
    8.2. Usage scenarios
9. Security Considerations
    9.1. Idnaprep-specific security considerations
    9.2. Generic Unicode security considerations
10. IANA Considerations
11. Acknowledgements
Appendix A. Character properties specification
Appendix B. Idnaprep Unicode 5.0 profile
12. References
    12.1. Normative References
    12.2. Informative References
§ Authors' Addresses
§ Intellectual Property and Copyright Statements

TOC

1. Introduction

This document specifies processing rules that will allow users to enter internationalized domain names (IDNs) into applications and have the highest chance of getting the content of the strings correct. These processing rules are only intended for internationalized domain names, not for arbitrary text.

The processing rules include the following steps:

Usage of a selected character repertoire
Mapping of Zero Width Joiner and Non Joiner characters
Unicode normalization NFC
Check combining marks
Check bidirectional handling

Idnaprep converts a single string of input characters (input-label) to a string of output characters (U-label), or returns an error if the output string would contain a prohibited output (per repertoire restriction or failure to the checking steps). In many cases, the input characters are unchanged and the process is a simple validation according to rules specified by this document. Idnaprep cannot both emit a string and return an error.

Idnaprep cannot account for all of the variations that might occur or that a user might expect. In particular, it will not be able to account for choice of spellings in all languages for all scripts because the number of alternative spellings of words and phrases is immense. Users would probably expect all spelling equivalents to be made equivalent, or none of them to be. Examples of spelling equivalents include "theater" vs. "theatre", and "hemoglobin" vs. "hU+00E6moglobin" in American vs. British English. Other examples are simplified Chinese spellings of names (for example,"<U+7EDF, U+4E00, U+7801>") vs. the equivalent traditional Chinese spelling (for example, "<U+7D71, U+4E00, U+78BC>"). Language-specific equivalences such as "Aepfel" vs. "U+00C4pfel", which are sometimes considered equivalent in German, may not be considered equivalent in other languages.

This document is an input for the planned update of IDN processing rules. Its status as a formal update of one or more existing RFCs is not determined at this point. It covers processing rules similar to what is described in nameprep [RFC3491] (Hoffman, P. and M. Blanchet, “Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN),” March 2003.) which is itself a profile of stringprep [RFC3454] (Hoffman, P. and M. Blanchet, “Preparation of Internationalized Strings ("stringprep"),” December 2002.). It addresses issues that were raised in the context of Internationalized Domain Names in Applications [RFC3490] (Faltstrom, P., Hoffman, P., and A. Costello, “Internationalizing Domain Names in Applications (IDNA),” March 2003.). Some of these issues are about bidirectional strings [IDNABidi] (Alvestrand, H. and C. Karp, “An IDNA problem in right-to-left scripts,” October 2006.), others about repertoire [IDNARepertoire] (Falstrom, P., “The Unicode Codepoints and IDN,” October 2006.), others in all aspects of stringprep [IDNABis] (Klensin, J., “Proposed Issues and Changes for IDNA - An Overview,” October 2006.).

Although this document makes heavy use of the concept of a restricted character repertoire, its current draft does not cover in details that concept. The proposal concerning that aspect is made in [IDNARepertoire] (Falstrom, P., “The Unicode Codepoints and IDN,” October 2006.). Instead, it concentrates on the simplified label preparation still required for IDN string processing.

Using a restricted repertoire allows the validation rules and their implementation to be much simplified compared to earlier IDN string preparation scheme.

This document uses Unicode character properties to group these characters into classes, instead of enumerated lists of characters. Because an increasing number of key Unicode properties are guaranteed to be stable and thus provide backward compatibility, using character properties in this manner will extend the applicability of this document when new Unicode versions are created.

This document uses IANA registered profiles to accommodate the evolution of the Unicode Standard. The creation of new profiles does not require a new version of idnaprep. Appendix B contains the profile defined for Unicode 5.0. Future profiles can be directly registered with IANA without inclusion in the idnaprep specification itself.

This document borrows heavily from concepts and principles that were introduced in stringprep and nameprep. Please see the acknowledgement section for details.

TOC

1.1. Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC 2119 [RFC2119] (Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” March 1997.).

Note: A glossary of terms used in the Unicode Standard [Unicode] (The Unicode Consortium, “The Unicode Standard Version 5.0,” October 2006.) and ISO/IEC 10646 [ISO10646] (International Organization for Standardization, “Information Technology - Universal Multiple-Octet Coded Character Set (UCS),” 2003.) can be found in [Glossary] (The Unicode Consortium, “Unicode Glossary,” September 2006.). Information on the 10646/Unicode character encoding model can be found in [CharModel] (Whistler, K., Davis, M., and A. Freytag, “Character Encoding Model.,” September 2004.). The character repertoires of the Unicode Standard and ISO/IEC 10646, and many other features such as the Bidirectional Algorithm and Normalizations are synchronized. Further references to the common set will be done using the Unicode versions. Only features that are unique to either standard will be referenced as such.

Code points not assigned to characters are called "unassigned" code points and are specified in appendix A.

Editor's note: The concept of accepting unassigned code points in some contexts is currently evaluated and may be removed from this specification in the future.

Character names in this document use the notation for code points and names from the Unicode Standard. For example, the letter "a" may be represented as either "U+0061" or "LATIN SMALL LETTER A". In lists of characters, the "U+" may be left off to make the lists easier to read. Sequences of characters may be represented using the UCS Sequence Identifiers or USI specified in ISO/IEC 10646 [ISO10646] (International Organization for Standardization, “Information Technology - Universal Multiple-Octet Coded Character Set (UCS),” 2003.). A USI has the form:

   <UID1, UID2,...UIDn>

where each UIDi represents a short identifier for the code point -- most commonly the U+ notation mentioned above. Comments for character ranges are shown in square brackets (such as "[CONTROL CHARACTERS]") and do not come from the standards.

TOC

1.2. Using idnaprep in protocols

The idnaprep protocol does not stand on its own; it has to be used by other protocols at precisely-defined places in those other protocols. For example, a protocol that has strings that come from the entire Unicode [Unicode] (The Unicode Consortium, “The Unicode Standard Version 5.0,” October 2006.) character repertoire might specify that only strings that have been processed with a idnaprep are legal when processing domain names.

Section 3 lists the subset of the Unicode repertoire that can be used in the context of internationalized domain names. Although this document is referencing Unicode version 5.0, it is designed to accommodate further evolution of the Unicode Standard. Because it uses a set of stable Unicode properties and algorithms, this document will not need to be revised when these new Unicode versions are created. This is a very important design goal of idnaprep.

However for each new version of Unicode used by idnaprep, the allowed IDN character repertoire is likely to grow. Although solely defined by references to the Unicode Standard, these sets of references will need to be registered with IANA to allow the growth of stored strings complying with idnaprep. Each of these sets of references constitutes a profile of idnaprep. Section 8 of this document covers these points.

If the protocol using idnaprep specifies that unassigned code points can be used, idnaprep will allow as input set the restricted subset and the unassigned code point corresponding to the implemented profile.

Unlike other string preparations, idnaprep does not include a case folding. Protocols and applications using idnaprep may however consider case folding before calling idnaprep.

Editor's note: This is however an open issue, an update to this document may introduce a case folding step if deemed preferable.

TOC

2. Preparation Overview

The steps for preparing strings are:

Check repertoire -- For each character in the input, check that it is part of the allowed repertoire, if not return an error. This is described in section 3. The allowed repertoire may include unassigned code points.
Map ZERO WIDTH NON-JOINER and ZERO WIDTH JOINER to either themselves or nothing. This is described in section 4.
Normalize -- Normalize the result of step 2 using Unicode normalization form NFC. This is described in section 5.
Check combining marks -- Check for any starting combining marks. If such an occurrence is found, return an error. This is described in section 6.
Check bidi -- Check for right-to-left characters, and if any are found, make sure that the whole string satisfies the requirements for bidirectional strings. If the string does not satisfy the requirements for bidirectional strings, return an error. This is described in section 7.

The above steps MUST be performed in the order given to comply with this specification.

The mappings described in section 4, and the Unicode normalization described in section 5, can be one-to-none, one-to-one, one-to-many, many-to-one, or many-to-many. That is, some characters might be eliminated or replaced by more than one character, and the output of this step might be shorter or longer than the input. Because of this, the system using idnaprep MUST be prepared to receive a longer or shorter string than the one input to the idnaprep algorithm.

Editor's note: Other steps may be required to accommodate additional contextual validation. For example U+00B7 may be allowed after U+006C LATIN SMALL LETTER L.

Note that the succesful preparation of a U-label from an input-label does not guarantee that the U-label is valid from a DNS point of view. The U-label needs to be converted to Punycode [RFC3492] (Costello, A., “Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA),” March 2003.) to verify that the resulting label has no more than 63 ASCII code points.

TOC

3. Idnaprep character repertoire

The character repertoire usable in the context of Internationalized Domain Names is a subset of the Unicode repertoire. Registered idnaprep profiles are linked to Unicode versions and MUST specify the restricted repertoire they use. An error is returned by idnaprep if characters outside that repertoire are used. Appendix B specifies the profile for Unicode 5.0 [Unicode] (The Unicode Consortium, “The Unicode Standard Version 5.0,” October 2006.).

Editor's note: the current draft of this document does not yet specify the idnaprep Unicode repertoire in full detail in appendix A. It is expected that further developments made in [IDNARepertoire] (Falstrom, P., “The Unicode Codepoints and IDN,” October 2006.) or similar initiatives such as in http://www.unicode.org/~whistler/IDNPermitted.txt will result in a new Unicode character property which will determine the character set suitable for idnaprep. When available, a future version of this document will reference that new property. However, this document assumes that the repertoire is mostly restricted to letters (including syllables and ideographs) and digits belonging to modern-use scripts, and that it contains few (if any) symbol or punctuation characters besides U+002D HYPHEN-MINUS. This document refers to that property as 'IDN_permitted'.

Editor's note: The repertoire is built by using the list of all characters that meet the following various criteria:

GeneralCategory(cp) == {Ll, Lo, Lm, Mn, Mc, Nd}
cp = NFKC (cp)
not one of the primarily historic scripts
excluding combining marks associated with symbols
adding back some symbols.

However none of these rules need to be exposed in this document, only the result of having a character being part of the restricted repertoire or not.

Note that some characters may be added to the repertoire of characters with the IDN_Permitted property in the future, as additional characters are added to the Unicode Standard. This would be the case, for example, when additional minority scripts are added to the standard. However, the maintenance of the IDN_Permitted property is bound by the stability guarantee that once a character is assigned that property, the property can never be removed from the character. In other words, the inclusion table may grow, but once a character is in the table, it can never be removed.

TOC

4. Mapping of Joiner and Non Joiner characters

Among the character repertoire allowed for idnaprep, two characters have a special status. These are U+200C ZERO WIDTH NON-JOINER (ZWNJ) and U+200D ZERO WIDTH JOINER (ZWJ). Although there are allowed in the restricted repertoire, idnaprep removes them from the output in most cases because they carry no meaning in these cases.

However, because in certain languages these two characters may carry meaning, there is a a special rule preserving the ZWNJ or ZWJ in the following contexts:

ZWNJ breaking a cursive connection in Arabic--

An "Arabic" "Right-Joining" character, followed by zero or more "Transparent" characters, followed by a ZWNJ, followed by zero or more "Transparent" characters, followed by a "Left-Joining" character.

ZWNJ used in a conjunct context--

A Letter, followed by zero or more "Combining Marks", followed by a "Virama", followed by zero or more "Combining Marks", followed by a ZWNJ, followed by zero or more "Combining Marks", followed by a "Letter".

ZWJ used in a conjunct context--

A "letter", followed by zero or more "Combining Marks", followed by a "Virama", followed by zero or more "Combining Marks", followed by a ZWJ.

These contexts imply that a single "script" is used within the expression, excluding the "Combining Marks" and "Viramas" which may use "Common" or "Inherited" script values.

Editor's note: The Unicode Consortium is currently investigating whether these special circumstances for ZWJ and ZWNJ can be narrowed further. See http://www.unicode.org/review/pr-96.html.

The character properties: "Arabic", "Right-Joining", "Transparent", "Left-Joining", "Combining Marks", "Letter", "Virama", and "script" are defined in appendix A.

TOC

5. Normalization

The output of the mapping step is normalized using the Unicode normalization with form C, as described in [UAX15] (Davis, M. and M. Duerst, “Unicode Normalization Forms,” October 2006.). This step is always successful, in other words it never returns an error.

Note that although the restricted character repertoire is stable through normalization, the normalization step is still necessary. There are three reasons for this requirement:

combining sequences that are made of elements of the restricted repertoire normalized into composite characters (example, <U+0061, U+0301> (LATIN SMALL LETTER A followed by COMBINING ACUTE ACCENT) becoming U+00E1 LATIN SMALL LETTER A WITH ACUTE),
combining mark re-ordering,
Hangul Jamos/syllables composition.

The restricted repertoire is designed in a way that all versions of the Unicode normalization form C starting from Unicode 5.0 will provide the same result for that repertoire.

Note that the Unicode normalization of any unassigned code point is always invariant. This is consistent with the requirement that the restricted repertoire of an idnaprep profile contain no character that changes value when normalized to normalization form C by itself. Additions to the restricted repertoire in idnaprep profiles for future Unicode versions MUST NOT include any character that changes value when normalized using NFC.

Idnaprep itself references the version of the normalization form C defined by Unicode 5.0 [UAX15] (Davis, M. and M. Duerst, “Unicode Normalization Forms,” October 2006.). Accordingly, idnaprep profiles MUST NOT reference NFC themselves.

Note that other string preparations use the Unicode normalization form KC (NFKC) which maps many "compatibility characters" to their equivalent character. However, because the restricted repertoire is stable through normalization (i.e. NFKC(cp)=cp), in others words it excludes compatibility characters that could be mapped by NFKC, it is unnecessary to use NFKC for the normalization of the string in the context of idnaprep. This choice of NFC is also consistent with the recommendation for Internationalized Resource identifiers (IRIS) [RFC3987] (Duerst, M. and M. Suignard, “Internationalized Resource Identifiers (IRIs),” January 2005.). However, an an application may still use NFKC to filter user input before applying idnaprep.

TOC

6. Combining Marks

Combining marks constitute a special character class that typically combines with its possible preceding combining marks back to the first non combining character. See the Unicode Standard [Unicode] (The Unicode Consortium, “The Unicode Standard Version 5.0,” October 2006.) for further details.

It is undesirable to have a combining mark appear as the first character of a string because it may combine with a character preceding the string, and which is therefore out of context.

A string MUST NOT start with a character having the "Combining Mark" property (see appendix A). An error is returned by idnaprep if this requirement is not satisfied.

TOC

7. Bidirectional Characters

Most characters are displayed from left to right, but some are displayed from right to left. This feature of Unicode is called "bidirectional text", or "bidi" for short. The Unicode Standard has an extensive discussion of how to reorder glyphs for display when dealing with bidirectional text such as Arabic or Hebrew. See the Unicode Bidirectional Algorithm [UAX9] (Davis, M., “The Bidirectional Algorithm,” September 2006.) for more information. In particular, all Unicode text is stored in logical order.

[Unicode] (The Unicode Consortium, “The Unicode Standard Version 5.0,” October 2006.) defines several bidirectional categories; each character has one bidirectional category assigned to it. For the purposes of the requirements below, three categories are used:

RCat character

Characters belonging to right to left scripts such as Hebrew, Arabic, Thaana, etc...

LCat character

Characters belonging to left to right script such as Latin, Greek, Cyrillic, etc...

NSMCat

Combining marks.

The character properties: "RCat", "LCat", and "NSMCat" are defined in appendix A.

The Unicode Bidirectional Algorithm [UAX9] (Davis, M., “The Bidirectional Algorithm,” September 2006.) can result in various rearrangements of characters according to their direction. To prevent characters from rearranging across field boundaries, the following three requirements MUST be met if a string contains any RCat characters. An error is returned by idnaprep if these requirements are not satisfied.

The string MUST NOT contain any "LCat" character,
The string MUST start with an "RCat" character,
The string MUST either end with an "RCat" character, or end with an "RCat" character followed by a sequence of "NSMCat" characters.

Note that requirement 3 prohibits strings such as <U+0627, U+0031> ("aleph 1") but allows strings such as <U+0627, U+0031, U+0628> ("aleph 1 beh"), and <U+078B, U+07A8, U+0788, U+07AC, U+0780, U+07A8> ("Divehi in Thaana script ending with a "NSMCat" character). [UAX9] (Davis, M., “The Bidirectional Algorithm,” September 2006.) goes into great detail about the display order of strings that contain particular categories of characters in particular sequences.

TOC

8. Unassigned Code Points in idnaprep

This section describes two different types of strings in typical protocols where internationalized strings are used: "stored strings" and "queries". In general, "stored strings" are strings that are used in named entities, such as DNS domain name parts. "Queries" are strings that are used to match against strings that are stored identifiers, such as user-entered names for DNS lookups.

All code points not assigned in the Unicode version establishing an idnaprep character repertoire are called "unassigned" code points. Stored strings using idnaprep MUST NOT contain any unassigned code points. Queries for matching strings MAY contain unassigned code points. Note that this is the only part of this document where the requirements for queries differ from the requirements for stored strings.

Editor's note: The concept of accepting unassigned code points for matching strings is currently evaluated and may be removed from this specification in the future.

The goal of the requirements in this section is to allow an older profile of idnaprep to match a stored string created with a newer profile, while disallowing any profile to create a stored string using unassigned code points.

TOC

8.1. idnaprep profiles

When a new version of the Unicode repertoire appears, idnaprep SHOULD NOT be updated. Instead a new profile MAY be registered with IANA to describe the new associated restricted repertoire and the set of Unicode data files to be used with that restricted repertoire when performing idnaprep. If the restricted repertoire is enlarged, a new profile MUST be created. The restricted repertoire MUST NOT be reduced by a new profile.

Each code point in a repertoire named by a profile of idnaprep can be categorized by how it acts in the process described in earlier sections of this document:

D -- Code points that cannot be in the output because they are disallowed in the repertoire checking step
MN -- Code points that cannot be in the output because they never appear as output from mapping or normalization
AO -- Code points that can be in the output
U -- Unassigned code points

A subsequent profile of idnaprep that references a newer version of a Unicode repertoire with new code points will inherently have some code points move from category U to D, MN, or AO. For backward compatibility, the following rules are provided for a subsequent profile of idnaprep concerning the other code points:

AO and MN code points MUST NOT move to another category.
D code points SHOULD NOT move to another category.

Stored strings MUST NOT contain any code points outside of AO for any profile of idnaprep. That is, they are forbidden to contain code points from the MN, D, or U categories.

Applications creating queries MUST treat U code points as if they were AO when preparing the query to be entered in the process described by a profile of idnaprep.

Any other change not related to code point reassignment, such as changes in the steps checking combining mark and checking bidirectional may introduce backward compatibility issues that could require a new version of idnaprep, not just a new profile. As such, they should be avoided as much as possible.

TOC

8.2. Usage scenarios

Typically, applications do not invoke idnaprep directly, they call a software library which contains an idnaprep implementation associated with an idnaprep profile. The exact profile used by the library may be unknown to the application.

The following diagram shows an example of data flow for both the creation and the query of a string.

input --> application --> library --> idnaprep profile --> resolver

Let us assume two profiles associated respectively with Unicode 5.0 and a hypothetical future version 5.x, named P50 and P5X, and two libraries, the first one implementing P50, the second one implementing P5X. The following scenarios are possible, assuming that the input content can be made of any assigned character in that future Unicode 5.x:

The same application can create stored content based on P50 or P5X libraries depending on which library it invokes.
An application querying with P50 library against stored P5X names may succeed in matching as long as the input data belongs to the P5X AO set.
An application querying with P5X library against stored P50 names may succeed in matching as long as the input data belongs to the P50 AO set.
An application querying with P50 library against stored P5X names will fail if the input data is not valid according to either P50 (unassigned) or P5X (assigned).
An application querying with P5X library against stored P50 names will fail if the input data is not valid according to P5X (unassigned) or P50 (assigned).

From this, it can be seen that querying applications will mostly succeed independently of the profile being used, as long as the input repertoire is unfiltered by the application and is valid according to the latest definition of the profile. On the other hand, applications creating stored names should always use the latest profile to maximize their success rate.

It is always preferable for an application to use a library implementing the latest idnaprep profile; however, the success rate of a query will be vastly improved by ensuring the following:

Proper placement of combining marks,
Valid bidirectional sequence,
Lower case only characters for bicameral scripts.

There is however one exceptional case, where X is a combining mark. The order of combining marks is normalized, so if another newer combining mark Y has a lower combining class than X then XY will be put in the canonical order YX. (Unassigned code points are never reordered, so this doesn't happen in an older profile). If the query contains YX, the query will get positive match with the stored string. However, no string can be stored with XY, so a query with XY will get a negative answer to the test for matching.

TOC

9. Security Considerations

Idnaprep is used with Unicode characters. There are security considerations that are specific to idnaprep, and others that are generic to using Unicode.

TOC

9.1. Idnaprep-specific security considerations

The Unicode repertoire has many characters that look similar. In many cases, users of security protocols might do visual matching, such as when comparing the names of trusted third parties. Because it is impossible to map similar-looking characters without a great deal of context such as knowing the fonts used, idnaprep does nothing to map similar-looking characters together nor to prohibit some characters because they look like others. User applications can help disambiguate some similar-looking characters by showing the user when the script changes within a string.

Note however that the repertoire restriction reduces significantly the risk created by similar-looking characters.

TOC

9.2. Generic Unicode security considerations

Protocols that use idnaprep usually also use encodings of Unicode, such as UTF-8 or UTF-16. Some applications using those encodings have been known to not check for ill-formed sequences in the encodings, and thereby have not detected sequences of octets that would have been detected if they used just ASCII. For example, in UTF-8 the octet sequence "0xC0 0xAB" is an ill-formed sequence for U+002B (plus sign). All programs MUST reject any string that is an ill-formed octet sequence for the encoding being used.

Both Unicode normalization and conversion between Unicode encodings can cause strings to grow or shrink. Programs that used fixed-size buffers, or that make assumptions that buffers will always be greater than or less than particular sizes, are likely to fail in insecure fashions when using Unicode normalization or encoding conversions.

Covering an extensive list of security threats and considerations on the use of current and future versions of Unicode is outside of the scope of this document. Additional considerations are available in [UTR36] (Davis, M. and M. Suignard, “Unicode Security Considerations,” August 2006.) and [UTS39] (Davis, M. and M. Suignard, “Unicode Security Mechanisms,” August 2006.).

TOC

10. IANA Considerations

Idnaprep versions MUST have IETF consensus as described in [RFC2434] (Narten, T. and H. Alvestrand, “Guidelines for Writing an IANA Considerations Section in RFCs,” October 1998.). Each of its profile MUST be reviewed by the IESG before it is registered. The IESG MAY change a profile before registration.

IANA has set up a registry of idnaprep profiles. This registry is a single text file that lists the known profiles. Each entry in the registry has three fields:

Profile name
RFC in which the profile is defined
Indicator whether or not this is the newest version of the profile

Each profile will remain listed in the registry forever. That is, if a newer profile of idnaprep is created, both profiles will continue to be listed in the registry, but the current version indicator will be turned off for the earlier profile and turned on for the newer profile.

To improve the stability of IDN, new profiles SHOULD be created sparingly, only when Unicode repertoire additions justify character addition in the idnaprep restricted repertoire.

TOC

11. Acknowledgements

Above all, this document use large fragments of text from stringprep and nameprep which where authored by Marc Blanchet and Paul Hoffman. This has vastly simplified the authoring of the terminology and the creation of the overall structure of this document. Many ideas and principles from these documents were also preserved. As such they deserve a large part of the credit for this new document.

The structure and principles in this document have also benefited from detailed discussions and feedback from Harald Tveit Alvestrand, Patrik Falstrom, Cary Karp, John C Klensin, and Ken Whistler.

TOC

Appendix A. Character properties specification

This appendix determines Unicode character properties used by this document. All subsequent references to ".txt" file names in this appendix imply that these files belong to the the Unicode Character Database [UCD] (The Unicode Consortium, “Unicode Character Database,” July 2006.).

The following table describes all Unicode character properties referenced by this document:

Property name	Description
script	The "script" property is a a string value associated with each character and is determined by "Scripts.txt"
Arabic	Character with "Arabic" "script" value
Right-Joining	Character with "R" Joining Type as specified by "ArabicShaping.txt"
Transparent	Character with "T"Joining Type as specified by "ArabicShaping.txt"
Left-Joining	Character with "L" Joining Type as specified by "ArabicShaping.txt"
Letter	Character with General_Category value of "Lu", "Ll", "Lt", "Lm", or "Lo" as specified in "UnicodeData.txt"
Combining Mark	Character with General_Category value of "Mc" or "Mn" as specified in "UnicodeData.txt"
Virama	Character with Canonical_Combining_Class value equal to "9" as specified in "UnicodeData.txt"
RCat	Character with Bidi_Class value of "R" or "AL" as specified in "UnicodeData.txt"
LCat	Character with Bidi_Class value of "L" as specified in "UnicodeData.txt"
NSMCat	Character with Bidi_Class value of "NSM" as specified in "UnicodeData.txt"
unassigned	Code point with General_Category value of "Cn" in "UnicodeData.txt"

Editor's note: If a character property is created to reference the restricted repertoire it could be added in the table above as "IDN_Permitted" and specified as "Characters included in "IDNPermitted.txt"".

TOC

Appendix B. Idnaprep Unicode 5.0 profile

The Idnaprep Unicode 5.0 profile is specified by using the Unicode 5.0 version of the Unicode Character Database [UCD] (The Unicode Consortium, “Unicode Character Database,” July 2006.).

TOC

12. References

TOC

12.1. Normative References

[RFC2119]	Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” BCP 14, RFC 2119, March 1997 (TXT, HTML, XML).
[UAX15]	Davis, M. and M. Duerst, “Unicode Normalization Forms,” Unicode Standard Annex #15, October 2006.
[UAX9]	Davis, M., “The Bidirectional Algorithm,” Unicode Standard Annex #9, September 2006.
[UCD]	The Unicode Consortium, “Unicode Character Database,” , July 2006.
[Unicode]	The Unicode Consortium, “The Unicode Standard Version 5.0,” Addison-Wesley, Reading, MA , October 2006.

TOC

12.2. Informative References

[CharModel]	Whistler, K., Davis, M., and A. Freytag, “Character Encoding Model.,” Unicode Technical Report #17, September 2004.
[Glossary]	The Unicode Consortium, “Unicode Glossary,” Unicode Glossary , September 2006.
[IDNABidi]	Alvestrand, H. and C. Karp, “An IDNA problem in right-to-left scripts,” Internet-Draft , October 2006.
[IDNABis]	Klensin, J., “Proposed Issues and Changes for IDNA - An Overview,” Internet-Draft , October 2006.
[IDNARepertoire]	Falstrom, P., “The Unicode Codepoints and IDN,” Internet-Draft , October 2006.
[ISO10646]	International Organization for Standardization, “Information Technology - Universal Multiple-Octet Coded Character Set (UCS),” ISO Standard 10646-1, with amendments 1 and 2, 2003.
[RFC2434]	Narten, T. and H. Alvestrand, “Guidelines for Writing an IANA Considerations Section in RFCs,” BCP 26, RFC 2434, October 1998 (TXT, HTML, XML).
[RFC3454]	Hoffman, P. and M. Blanchet, “Preparation of Internationalized Strings ("stringprep"),” RFC 3454, December 2002.
[RFC3490]	Faltstrom, P., Hoffman, P., and A. Costello, “Internationalizing Domain Names in Applications (IDNA),” RFC 3490, March 2003.
[RFC3491]	Hoffman, P. and M. Blanchet, “Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN),” RFC 3491, March 2003.
[RFC3492]	Costello, A., “Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA),” RFC 3492, March 2003.
[RFC3987]	Duerst, M. and M. Suignard, “Internationalized Resource Identifiers (IRIs),” RFC 3987, January 2005.
[UTR36]	Davis, M. and M. Suignard, “Unicode Security Considerations,” Unicode Technical Report #36, August 2006.
[UTS39]	Davis, M. and M. Suignard, “Unicode Security Mechanisms,” Unicode Technical Standard #36, August 2006.

TOC

Authors' Addresses

	Michel Suignard (editor)
	Microsoft Corporation
	One Microsoft Way
	Redmond, WA 98052
	U.S.A.
Phone:	+1 425 882-8080
Email:	michelsu@microsoft.com
URI:	http://www.suignard.com

	Mark Davis
	Google
	U.S.A.
Email:	mark.davis@macchiato.com or mark.davis@google.com

	Asmus Freytag
	ASMUS Inc.
	U.S.A.
Email:	asmus@unicode.org
URI:	http://home.ix.netcom.com/~asmus-inc/

TOC

Full Copyright Statement

This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights.

This document and the information contained herein are provided on an “AS IS” basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Intellectual Property

The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79.

Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr.

The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org.

Acknowledgment

Funding for the RFC Editor function is provided by the IETF Administrative Support Activity (IASA).

Preparation of Internationalized Domain Names (idnaprep)draft-suignard-idnaprep-00