L2/07-073R

 TOC 
 M. Suignard, Ed.
 Microsoft Corporation
 M. Davis
 Google
 A. Freytag
 ASMUS Inc.
 March 12, 2007

Working Draft

Preparation of Internationalized Domain Names (idnaprep)

Abstract

This document describes how to prepare internationalized domain name (IDN) labels in order to increase the likelihood that name input and name comparison work in ways that make sense for typical users throughout the world.

This document is an input in the process of defining a new string preparation in the context of International Domain Name. It should not be construed as a competitive initiative to the work represented by "Proposed Issues and Changes for IDNA - An overview" (aka [IDNABis] (Klensin, J., “Proposed Issues and Changes for IDNA - An Overview,” February 2007.)). It is merely a public document representing the view of experts in Unicode technology and implementers of IDN (see the Introduction for more details). It may or may not be used in part in a possible revision of IDN. It uses a format similar to Internet Drafts merely for editing convenience.

This document is supplied purely for informational purposes and publication does not imply any endorsement by the Unicode Consortium. As such, it may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.



Table of Contents

1.  Introduction
    1.1.  Terminology
    1.2.  Using idnaprep in protocols
2.  Preparation Overview
3.  Idnaprep character repertoire
4.  Mapping of Joiner and Non Joiner characters
5.  Normalization
6.  Combining Marks
7.  Bidirectional Characters
8.  idnaprep profiles
9.  Security Considerations
    9.1.  Idnaprep-specific security considerations
    9.2.  Generic Unicode security considerations
10.  IANA Considerations
11.  Acknowledgements
Appendix A.  Unicode database references
Appendix B.  Idnaprep Unicode 5.0 profile
Appendix C.  Case folding
12.  References
    12.1.  Normative References
    12.2.  Informative References
§  Authors' Addresses
§  Intellectual Property and Copyright Statements




 TOC 

1.  Introduction

This document specifies processing rules that will allow users to enter internationalized domain names (IDNs) into applications and have the highest chance of getting the content of the strings correct. These processing rules are only intended for internationalized domain names, not for arbitrary text.

The processing rules include the following steps:

Idnaprep converts a single string of input characters (input-label) to a string of output characters (U-label), or returns an error if the output string would contain a prohibited output (per repertoire restriction or failure to the checking steps). In many cases, the input characters are unchanged and the process is a simple validation according to rules specified by this document. Idnaprep cannot both emit a string and return an error.

This document is an input for the planned update of IDN processing rules. It is not meant as an update to any RFCs specifying IDN. It covers processing rules similar to what is described in nameprep [RFC3491] (Hoffman, P. and M. Blanchet, “Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN),” March 2003.) which is itself a profile of stringprep [RFC3454] (Hoffman, P. and M. Blanchet, “Preparation of Internationalized Strings ("stringprep"),” December 2002.). It addresses issues that were raised in the context of Internationalized Domain Names in Applications [RFC3490] (Faltstrom, P., Hoffman, P., and A. Costello, “Internationalizing Domain Names in Applications (IDNA),” March 2003.). Some of these issues are about bidirectional strings [IDNABidi] (Alvestrand, H. and C. Karp, “An IDNA problem in right-to-left scripts,” October 2006.), others about repertoire [IDNARepertoire] (Falstrom, P., “The Unicode Codepoints and IDN,” October 2006.), others in all aspects of stringprep [IDNABis] (Klensin, J., “Proposed Issues and Changes for IDNA - An Overview,” February 2007.).

This document is aligned with the design proposed in [IDNABis] (Klensin, J., “Proposed Issues and Changes for IDNA - An Overview,” February 2007.) and aims at addressing the name processing steps described in sections 3.2.1.3 to 3.2.1.5 (Registration) and 3.2.2.3 to 3.2.2.6 (Domain Name Resolution). It also addresses the right to left issue described in section 6.3 and follows the design criteria described in section 8.1. In other words this document should not be considered as a competitive proposal to [IDNABis] (Klensin, J., “Proposed Issues and Changes for IDNA - An Overview,” February 2007.) but rather as a complementary piece.

Although this document makes heavy use of the concept of restricted character repertoires, its current draft does not cover in details that concept. The proposal concerning that aspect is made in [IDNARepertoire] (Falstrom, P., “The Unicode Codepoints and IDN,” October 2006.). Instead, it concentrates on the simplified label preparation still required for IDN string processing.

Using a restricted repertoire allows the validation rules and their implementation to be much simplified compared to earlier IDN string preparation scheme.

This document uses Unicode character properties to group these characters into classes, instead of enumerated lists of characters. Because an increasing number of key Unicode properties are guaranteed to be stable and thus provide backward compatibility, using character properties in this manner will extend the applicability of this document when new Unicode versions are created.

This document uses IANA registered profiles to accommodate the evolution of the Unicode Standard. The creation of new profiles does not require a new version of idnaprep. Appendix B contains the profile defined for Unicode 5.0. Future profiles can be directly registered with IANA without inclusion in the idnaprep specification itself.

This document borrows heavily from concepts and principles that were introduced in stringprep and nameprep. Please see the acknowledgement section for details.



 TOC 

1.1.  Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC 2119 [RFC2119] (Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” March 1997.).

Note: A glossary of terms used in the Unicode Standard [Unicode] (The Unicode Consortium, “The Unicode Standard Version 5.0,” October 2006.) and ISO/IEC 10646 [ISO10646] (International Organization for Standardization, “Information Technology - Universal Multiple-Octet Coded Character Set (UCS),” 2003.) can be found in [Glossary] (The Unicode Consortium, “Unicode Glossary,” September 2006.). Information on the 10646/Unicode character encoding model can be found in [CharModel] (Whistler, K., Davis, M., and A. Freytag, “Character Encoding Model.,” September 2004.). The character repertoires of the Unicode Standard and ISO/IEC 10646, and many other features such as the Bidirectional Algorithm and Normalizations are synchronized. Further references to the common set will be done using the Unicode versions. Only features that are unique to either standard will be referenced as such.

Code points not assigned to characters are called "unassigned" code points and are specified in appendix A.

Character names in this document use the notation for code points and names from the Unicode Standard. For example, the letter "a" may be represented as either "U+0061" or "LATIN SMALL LETTER A". In lists of characters, the "U+" may be left off to make the lists easier to read. Sequences of characters may be represented using the UCS Sequence Identifiers or USI specified in ISO/IEC 10646 [ISO10646] (International Organization for Standardization, “Information Technology - Universal Multiple-Octet Coded Character Set (UCS),” 2003.). A USI has the form:

   <UID1, UID2,...UIDn>

where each UIDi represents a short identifier for the code point -- most commonly the U+ notation mentioned above. Comments for character ranges are shown in square brackets (such as "[CONTROL CHARACTERS]") and do not come from the standards.



 TOC 

1.2.  Using idnaprep in protocols

The idnaprep protocol does not stand on its own; it has to be used by other protocols at precisely-defined places in those other protocols. For example, a protocol that has strings that come from the entire Unicode [Unicode] (The Unicode Consortium, “The Unicode Standard Version 5.0,” October 2006.) character repertoire might specify that only strings that have been processed with a idnaprep are legal when processing domain names.

Section 3 lists the subset of the Unicode repertoire that can be used in the context of internationalized domain names. Although this document is referencing Unicode version 5.0, it is designed to accommodate further evolution of the Unicode Standard. Because it uses a set of stable Unicode properties and algorithms, this document will not need to be revised when these new Unicode versions are created. This is a very important design goal of idnaprep.

However for each new version of Unicode used by idnaprep, the allowed IDN character repertoire is likely to grow. Although solely defined by references to the Unicode Standard, these sets of references will need to be registered with IANA to allow the growth of stored strings complying with idnaprep. Each of these sets of references constitutes a profile of idnaprep. Section 8 of this document covers these points.



 TOC 

2.  Preparation Overview

The steps for preparing strings are:

  1. Check repertoire -- For each character in the input, check that it is part of the allowed repertoire, if not return an error. This is described in section 3.
  2. Map ZERO WIDTH NON-JOINER and ZERO WIDTH JOINER to either themselves or nothing. This is described in section 4.
  3. Normalize -- Normalize the result of step 2 using Unicode normalization form NFC. This is described in section 5.
  4. Check combining marks -- Check for any starting combining marks. If such an occurrence is found, return an error. This is described in section 6.
  5. Check bidi -- Check for right-to-left characters, and if any are found, make sure that the whole string satisfies the requirements for bidirectional strings. If the string does not satisfy the requirements for bidirectional strings, return an error. This is described in section 7.

The above steps MUST be performed in the order given to comply with this specification.

The mapping described in section 4, and the Unicode normalization described in section 5, can be one-to-none, one-to-one, one-to-many, many-to-one, or many-to-many. That is, some characters might be eliminated or replaced by more than one character, and the output of this step might be shorter or longer than the input. Because of this, the system using idnaprep MUST be prepared to receive a longer or shorter string than the one input to the idnaprep algorithm.

Many protocols and applications are likely to perform additional processing steps prior to idnaprep. For example, it is possible to use Unicode normalization form NFKC to filter out compatibility characters and to perform case folding to lower case characters. Because case folding is a common pre-processing steps in the context of idnaprep it is documented in appendix C.

[IDNABis] (Klensin, J., “Proposed Issues and Changes for IDNA - An Overview,” February 2007.) introduced various terms concerning IDN labels that are useful in the context of IDNAprep.



 TOC 

3.  Idnaprep character repertoire

The character repertoire usable in the context of Internationalized Domain Names is a subset of the Unicode repertoire. Registered idnaprep profiles are linked to Unicode versions and MUST specify the restricted repertoires they use. An error is returned by idnaprep if characters outside those repertoires are used. Appendix B specifies the profile for Unicode 5.0 [Unicode] (The Unicode Consortium, “The Unicode Standard Version 5.0,” October 2006.).

The Unicode code points corresponding to a specific version can be classified in four distinct categories in the IDN context:

  1. code points corresponding to characters used for Registration, referred as "IDNA200X-Permitted".
  2. code points corresponding to characters that may be used in the future for Registration but are not allowed at the present time, referred as "IDNA2000X-Possible".
  3. code points corresponding to characters that will never be allowed to be used for Registration, referred as "IDNA200X-Never".
  4. code points that are either unassigned or will never be assigned to coded characters, referred as "unassigned".

When IDN processing is done for Registration, the repertoire corresponding to "IDNA2000X-permitted" MUST be used.

When IDN processing is done for Domain Name Resolution, in addition to "IDNA200X-Permitted", the repertoire corresponding to "IDNA2000X-Possible" MAY also be used. The repertoires usable in the contexts of Registration or Domain Name Resolution are qualified as 'restricted' repertoires in this document.

Characters categorized as "IDNA200X-Never" and "unassigned" MUST NOT be used.

Editor's note: the current draft of this document does not yet specify these idnaprep Unicode repertoires in full detail in appendix A. It is expected that further developments made in [IDNARepertoire] (Falstrom, P., “The Unicode Codepoints and IDN,” October 2006.) or similar initiatives such as in http://www.unicode.org/~whistler/IDNPermitted.txt and http://www.unicode.org/~whistler/IDNNever.txt will result in new Unicode character properties determining the character set suitable for idnaprep. When available, a future version of this document will reference those new properties. However, this document assumes that the repertoires are mostly restricted to letters (including syllables and ideographs) and digits belonging to modern-use scripts, and that they contain few (if any) symbol or punctuation characters besides U+002D HYPHEN-MINUS. This document refers to these properties as 'IDN_Permitted' and 'IDN_Possible'.

Editor's note: The repertoires are built by using the list of all characters that meet the following various criteria:

However none of these rules need to be exposed in this document, only the result of having a character being part of a restricted repertoire or not.

Note that some characters may be added to the IDN_Permitted property in the future, either by being moved from the "IDNA200X-Possible" repertoire, or when new characters are added to the Unicode Standard. This would be the case, for example, when additional minority scripts are added to the standard. However, the maintenance of the "IDN_Permitted" property is bound by the stability guarantee that once a character is assigned that property, the property can never be removed from the character. In other words, the "IDNA200X-Permitted" repertoire may grow, but once a character is in repertoire, it can never be removed. Finally, the "IDN_Never" property is bound by the stability guarantee that once a character is IDN_Never, it can never end up in "IDNA2000X-Possible" or "IDNA200X-Permitted".



 TOC 

4.  Mapping of Joiner and Non Joiner characters

Among the character repertoire allowed for idnaprep, two characters have a special status. These are U+200C ZERO WIDTH NON-JOINER (ZWNJ) and U+200D ZERO WIDTH JOINER (ZWJ). Although there are allowed in the restricted repertoires, idnaprep removes them from the output in most cases because they carry no meaning in these cases.

However, because in certain languages these two characters may carry meaning, there is a a special rule preserving the ZWNJ or ZWJ in the following contexts:

ZWNJ breaking a cursive connection between Arabic Characters--
An "Arabic" "Right-Joining" character, followed by zero or more "Transparent" characters, followed by a ZWNJ, followed by zero or more "Transparent" characters, followed by a "Left-Joining" character.
ZWNJ used in a conjunct context--
A "Letter" of a conjunct forming script, followed by zero or more "Combining Marks", followed by a "Virama", followed by zero or more "Combining Marks", followed by a ZWNJ, followed by zero or more "Combining Marks", followed by a "Letter" of the same script.
ZWJ used in a conjunct context--
A "Letter" of a conjunct forming script, followed by zero or more "Combining Marks", followed by a "Virama", followed by zero or more "Combining Marks", followed by a ZWJ.

Typical examples of conjunct forming script include the Indic scripts. These contexts imply that a single "script" is used within the expression, excluding the "Combining Marks" and "Viramas" which may use "Common" or "Inherited" script values.

Editor's note: The Unicode Consortium is currently investigating the creation of Indic derived properties that will make the formulation of the Joiner rectriction much simpler. See http://www.unicode.org/review/pr-96.html for details on the current status.

The character properties: "Arabic", "Right-Joining", "Transparent", "Left-Joining", "Combining Marks", "Letter", "Virama", and "script" are defined in appendix A.



 TOC 

5.  Normalization

The output of the mapping step is normalized using the Unicode normalization with form C, as described in [UAX15] (Davis, M. and M. Duerst, “Unicode Normalization Forms,” October 2006.). This step is always successful, in other words it never returns an error.

Note that although the restricted character repertoires are stable through normalization, the normalization step is still necessary. There are three reasons for this requirement:

  1. combining sequences that are made of elements of the restricted repertoires normalized into composite characters (example, <U+0061, U+0301> (LATIN SMALL LETTER A followed by COMBINING ACUTE ACCENT) becoming U+00E1 LATIN SMALL LETTER A WITH ACUTE),
  2. combining mark re-ordering,
  3. Hangul Jamos/syllables composition.

The restricted repertoires are designed in a way that all versions of the Unicode normalization form C starting from Unicode 5.0 will provide the same result for those repertoires.

The restricted repertoires of an idnaprep profile cannot contain any character that changes value when normalized to normalization form C by itself. Additions to the restricted repertoires in idnaprep profiles for future Unicode versions MUST NOT include any character that changes value when normalized using NFC.

Idnaprep itself references the version of the normalization form C defined by Unicode 5.0 [UAX15] (Davis, M. and M. Duerst, “Unicode Normalization Forms,” October 2006.). Accordingly, idnaprep profiles MUST NOT reference NFC themselves.

Note that other string preparations use the Unicode normalization form KC (NFKC) which maps many "compatibility characters" to their equivalent character. However, because the restricted repertoires are stable through normalization (i.e. NFKC(cp)=cp), in others words they exclude compatibility characters that could be mapped by NFKC, it is unnecessary to use NFKC for the normalization of the string in the context of idnaprep. This choice of NFC is also consistent with the recommendation for Internationalized Resource identifiers (IRIS) [RFC3987] (Duerst, M. and M. Suignard, “Internationalized Resource Identifiers (IRIs),” January 2005.). However, an an application may still use NFKC to filter user input before applying idnaprep.



 TOC 

6.  Combining Marks

Combining marks constitute a special character class that typically combines with its possible preceding combining marks back to the first non combining character. See the Unicode Standard [Unicode] (The Unicode Consortium, “The Unicode Standard Version 5.0,” October 2006.) for further details.

It is undesirable to have a combining mark appear as the first character of a string because it may combine with a character preceding the string, and which is therefore out of context.

A string MUST NOT start with a character having the "Combining Mark" property (see appendix A). An error is returned by idnaprep if this requirement is not satisfied.



 TOC 

7.  Bidirectional Characters

Most characters are displayed from left to right, but some are displayed from right to left. This feature of Unicode is called "bidirectional layout", or "bidi" for short. The Unicode Standard has an extensive discussion of how to reorder glyphs for display when dealing with bidirectional text such as Arabic or Hebrew. See the Unicode Bidirectional Algorithm [UAX9] (Davis, M., “The Bidirectional Algorithm,” September 2006.) for more information. In particular, all Unicode text is stored in logical order.

[Unicode] (The Unicode Consortium, “The Unicode Standard Version 5.0,” October 2006.) defines several bidirectional categories; each character has one bidirectional category assigned to it. For the purposes of the requirements below, three categories are used:

RCat character
Characters belonging to right to left scripts such as Hebrew, Arabic, Thaana, etc...
LCat character
Characters belonging to left to right script such as Latin, Greek, Cyrillic, etc...
NSMCat
Combining marks.

The character properties: "RCat", "LCat", and "NSMCat" are defined in appendix A.

The Unicode Bidirectional Algorithm [UAX9] (Davis, M., “The Bidirectional Algorithm,” September 2006.) can result in various rearrangements of characters according to their direction. To prevent characters from rearranging across field boundaries, the following three requirements MUST be met. An error is returned by idnaprep if these requirements are not satisfied.

a.
The string MUST NOT contain any "RCat" character,
b.
Or if it does, the string must satisfy all of these requirements
  1. The string MUST NOT contain any "LCat" character,
  2. The string MUST start with an "RCat" character,
  3. The string MUST either end with an "RCat" character, or end with an "RCat" character followed by a sequence of "NSMCat" characters.

Note that requirement 3 prohibits strings such as <U+0627, U+0031> ("aleph 1") but allows strings such as <U+0627, U+0031, U+0628> ("aleph 1 beh"), and <U+078B, U+07A8, U+0788, U+07AC, U+0780, U+07A8> ("Divehi in Thaana script ending with a "NSMCat" character). [UAX9] (Davis, M., “The Bidirectional Algorithm,” September 2006.) goes into great detail about the display order of strings that contain particular categories of characters in particular sequences.



 TOC 

8.  idnaprep profiles

When a new version of the Unicode repertoire appears, idnaprep SHOULD NOT be updated. Instead a new profile MAY be registered with IANA to describe the new associated restricted repertoires corresponding to the characters with IDN_Permitted and IDN_Possible property values. The restricted repertoire corresponding to IDN_Permitted MUST NOT be reduced by a new profile.

Any other change not related to code point reassignment, such as changes in the steps checking combining mark and checking bidirectional may introduce backward compatibility issues that could require a new version of idnaprep, not just a new profile. As such, they should be avoided as much as possible.



 TOC 

9.  Security Considerations

Idnaprep is used with Unicode characters. There are security considerations that are specific to idnaprep, and others that are generic to using Unicode.



 TOC 

9.1.  Idnaprep-specific security considerations

The Unicode repertoire has many characters that look similar. In many cases, users of security protocols might do visual matching, such as when comparing the names of trusted third parties. Because it is impossible to map similar-looking characters without a great deal of context such as knowing the fonts used, idnaprep does nothing to map similar-looking characters together nor to prohibit some characters because they look like others. User applications can help disambiguate some similar-looking characters by showing the user when the script changes within a string.

Note however that the repertoire restriction reduces significantly the risk created by similar-looking characters.



 TOC 

9.2.  Generic Unicode security considerations

Protocols that use idnaprep usually also use encodings of Unicode, such as UTF-8 or UTF-16. Some applications using those encodings have been known to not check for ill-formed sequences in the encodings, and thereby have not detected sequences of octets that would have been detected if they used just ASCII. For example, in UTF-8 the octet sequence "0xC0 0xAB" is an ill-formed sequence for U+002B (plus sign). All programs MUST reject any string that is an ill-formed octet sequence for the encoding being used.

Both Unicode normalization and conversion between Unicode encodings can cause strings to grow or shrink. Programs that used fixed-size buffers, or that make assumptions that buffers will always be greater than or less than particular sizes, are likely to fail in insecure fashions when using Unicode normalization or encoding conversions.

Covering an extensive list of security threats and considerations on the use of current and future versions of Unicode is outside of the scope of this document. Additional considerations are available in [UTR36] (Davis, M. and M. Suignard, “Unicode Security Considerations,” August 2006.) and [UTS39] (Davis, M. and M. Suignard, “Unicode Security Mechanisms,” August 2006.).



 TOC 

10.  IANA Considerations

Idnaprep versions MUST have IETF consensus as described in [RFC2434] (Narten, T. and H. Alvestrand, “Guidelines for Writing an IANA Considerations Section in RFCs,” October 1998.). Each of its profiles MUST be reviewed by the IESG before it is registered. The IESG MAY change a profile before registration.

IANA has set up a registry of idnaprep profiles. This registry is a single text file that lists the known profiles. Each entry in the registry has three fields:

Each profile will remain listed in the registry forever. That is, if a newer profile of idnaprep is created, both profiles will continue to be listed in the registry, but the current version indicator will be turned off for the earlier profile and turned on for the newer profile.

To improve the stability of IDN, new profiles SHOULD be created sparingly, only when Unicode repertoire additions justify character addition in the idnaprep restricted repertoire.



 TOC 

11.  Acknowledgements

Above all, this document uses large fragments of text from stringprep and nameprep which where authored by Marc Blanchet and Paul Hoffman. This has vastly simplified the authoring of the terminology and the creation of the overall structure of this document. Many ideas and principles from these documents were also preserved. As such they deserve a large part of the credit for this new document.

The structure and principles in this document have also benefited from detailed discussions and feedback from Harald Tveit Alvestrand, Tina Dam, Patrik Falstrom, Cary Karp, John C Klensin, and Ken Whistler.



 TOC 

Appendix A.  Unicode database references

This appendix references Unicode character properties from the Unicode Character Database [UCD] (The Unicode Consortium, “Unicode Character Database,” July 2006.) which are required by this document. All subsequent references to ".txt" file names imply that these files belong to the the Unicode Character Database [UCD] (The Unicode Consortium, “Unicode Character Database,” July 2006.).The following table describes all Unicode character properties referenced by this document:

Property nameDescription
script The "script" property is a a string value associated with each character and is determined by "Scripts.txt"
Arabic Character with "Arabic" "script" value
Right-Joining Character with "R" Joining Type as specified by "ArabicShaping.txt"
Transparent Character with "T"Joining Type as specified by "ArabicShaping.txt"
Left-Joining Character with "L" Joining Type as specified by "ArabicShaping.txt"
Letter Character with General_Category value of "Lu", "Ll", "Lt", "Lm", or "Lo" as specified in "UnicodeData.txt"
Combining Mark Character with General_Category value of "Mc" or "Mn" as specified in "UnicodeData.txt"
Virama Character with Canonical_Combining_Class value equal to "9" as specified in "UnicodeData.txt"
RCat Character with Bidi_Class value of "R" or "AL" as specified in "UnicodeData.txt"
LCat Character with Bidi_Class value of "L" as specified in "UnicodeData.txt"
NSMCat Character with Bidi_Class value of "NSM" as specified in "UnicodeData.txt"
unassigned Code point with General_Category value of "Cn" in "UnicodeData.txt"

Editor's note: If Unicode character properties are created to reference the restricted repertoires they could be added in the table above as "IDN_Permitted" and "IDN_Possible" and specified as "Characters included in "IDNPermitted.txt" and IDNPOssible.txt" respectively.



 TOC 

Appendix B.  Idnaprep Unicode 5.0 profile

The Idnaprep Unicode 5.0 profile is specified by using the Unicode 5.0 version of the Unicode Character Database [UCD] (The Unicode Consortium, “Unicode Character Database,” July 2006.) for all references described in appendix A.



 TOC 

Appendix C.  Case folding

This appendix describes the optional case folding that may be performed by applications and protocol prior to idnaprep. The case folding specified by this appendix is cultural insensitive because there is no mechanism in the IDN context to convey a cultural parameter. Furthermore, performing a language-specific case mappings on IDN labels could result in different resolutions for the same input. For example, the string "III", if lowercased by Turkish casing rules, would result in a different U-label than if lowercased by English casing rules.

The mapping process is not recursive. That is, if character A at position X is mapped to character B, character B which is now at position X is not mapped again.

The case folding process maps uppercase to lower case characters according to the Case Mapping process specified by the "CaseFolding.txt" file in the Unicode Standard [Unicode] (The Unicode Consortium, “The Unicode Standard Version 5.0,” October 2006.). That file specifies a case folding string for each code point that has a case folding.

The entries in the file are in the following machine-readable format:

   <code>; <status>; <mapping>; # <name>

The code points represented in the "code" field are mapped into the "mapping" with rules established by the "status" field.

The status field have the following interpretation:

Status fieldDescription
C common case folding, common mappings shared by both simple and full mappings.
F full case folding, mappings that cause strings to grow in length. Multiple characters are separated by spaces,
S simple case folding, mappings to single characters where different from F,
T special cases.

Foldings with status field values (S) and (T) are not used by this document. Common (C) and Full (F) entries are mutually exclusive.The mapping is done using one of the following steps:

  1. For each code point with a (C) case folding, replace the entry code with its common case folding value.
  2. For each code point with a (F) case folding, replace the entry code point with its full case folding value.
  3. Other code points map to themselves.


 TOC 

12.  References



 TOC 

12.1. Normative References

[RFC2119] Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” BCP 14, RFC 2119, March 1997 (TXT, HTML, XML).
[UAX15] Davis, M. and M. Duerst, “Unicode Normalization Forms,” Unicode Standard Annex #15, October 2006.
[UAX9] Davis, M., “The Bidirectional Algorithm,” Unicode Standard Annex #9, September 2006.
[UCD] The Unicode Consortium, “Unicode Character Database,”  , July 2006.
[Unicode] The Unicode Consortium, “The Unicode Standard Version 5.0,” Addison-Wesley, Reading, MA , October 2006.


 TOC 

12.2. Informative References

[CharModel] Whistler, K., Davis, M., and A. Freytag, “Character Encoding Model.,” Unicode Technical Report #17, September 2004.
[Glossary] The Unicode Consortium, “Unicode Glossary,” Unicode Glossary , September 2006.
[IDNABidi] Alvestrand, H. and C. Karp, “An IDNA problem in right-to-left scripts,” Internet-Draft , October 2006.
[IDNABis] Klensin, J., “Proposed Issues and Changes for IDNA - An Overview,” Internet-Draft , February 2007.
[IDNARepertoire] Falstrom, P., “The Unicode Codepoints and IDN,” Internet-Draft , October 2006.
[ISO10646] International Organization for Standardization, “Information Technology - Universal Multiple-Octet Coded Character Set (UCS),” ISO Standard 10646-1, with amendments 1 and 2, 2003.
[RFC2434] Narten, T. and H. Alvestrand, “Guidelines for Writing an IANA Considerations Section in RFCs,” BCP 26, RFC 2434, October 1998 (TXT, HTML, XML).
[RFC3454] Hoffman, P. and M. Blanchet, “Preparation of Internationalized Strings ("stringprep"),” RFC 3454, December 2002.
[RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, “Internationalizing Domain Names in Applications (IDNA),” RFC 3490, March 2003.
[RFC3491] Hoffman, P. and M. Blanchet, “Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN),” RFC 3491, March 2003.
[RFC3987] Duerst, M. and M. Suignard, “Internationalized Resource Identifiers (IRIs),” RFC 3987, January 2005.
[UTR36] Davis, M. and M. Suignard, “Unicode Security Considerations,” Unicode Technical Report #36, August 2006.
[UTS39] Davis, M. and M. Suignard, “Unicode Security Mechanisms,” Unicode Technical Standard #36, August 2006.


 TOC 

Authors' Addresses

  Michel Suignard (editor)
  Microsoft Corporation
  One Microsoft Way
  Redmond, WA 98052
  U.S.A.
Phone:  +1 425 882-8080
Email:  michelsu@microsoft.com
URI:  http://www.suignard.com
  
  Mark Davis
  Google
  U.S.A.
Email:  mark.davis@macchiato.com or mark.davis@google.com
  
  Asmus Freytag
  ASMUS Inc.
  U.S.A.
Email:  asmus@unicode.org
URI:  http://home.ix.netcom.com/~asmus-inc/


 TOC 

Full Copyright Statement