L2/00-052 IETF IDN Working Group James Seng 22nd Feb 2000 Expires 22nd Aug 2000 Requirements of Internationalized Domain Names Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes the requirement for encoding international characters into DNS names and records. This document is guidance for developing protocols for internationalized domain names. 1. Introduction At present, the encoding of Internet domain names is restricted to a subset of 7-bit ASCII (ISO/IEC 646). HTML, XML, IMAP, FTP, and many other text based items on the Internet have already been internationalized. It is important for domain names to be similarly internationalized. This document is being discussed on the "idn" mailing list. To join the list, send a message to with the words "subscribe idn" in the body of the message. Archives of the mailing list can also be found at ftp://ops.ietf.org/pub/lists/idn*. 1.1 Definitions and Conventions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. "IDN" is used in this document as an abbreviation for "internationalized domain name". This is defined as a domain name that contains one or more characters that are outside the set of characters specified as legal Expires 22nd of August 2000 [Page 1] Internet Draft Requirements of IDN 22nd Feb 2000 characters for domain names in [RFC1034] Section 3.5. A master server for a zone holds the main copy of that zone. This copy is sometimes stored in a zone file. A slave server for a zone holds a complete copy of the records for that zone. A caching server holds temporary copies of DNS records; it uses records to answer queries about domain names. Further explanation of these terms can be found in [RFC1034] and [RFC1996]. Characters mentioned in this document are identified by their position in the Unicode character set. The notation U+12AB, for example, indicates the character at position 12AB (hexadecimal) in the Unicode character set. Note that the use of this notation is not an indication of a requirement to use Unicode. Examples quoted in this document should be considered as a method to further explain the meanings and principles adopted by the document. It is not a requirement for the protocol to satisfy the examples. A character is a member of a set of elements used for organization, control, or representation of data. A coded character is a character with its coded representation. A coded character set ("CCS") is a set of unambiguous rules that establishes a character set and the relationship between the characters of the set and their coded representation. A graphic character or glyph is a character, other than a control function, that has a visual representation normally handwritten, printed, or displayed. A character encoding scheme or "CES" is a mapping from one or more coded character sets to a set of octets. Some CESs are associated with a single CCS; for example, UTF-8 applies only to ISO 10646. Other CESs, such as ISO 2022, are associated with many CCSs. A charset is a method of mapping a sequence of octets to a sequence of abstract characters. A charset is, in effect, a combination of one or more CCS with a CES. Charset names are registered by the IANA according to procedures documented in RFC 2278. A language is a way that humans interact. In written form, a language is expressed in characters. The same set of characters can often be used in many languages, and many languages can be expressed using different scripts. A particular charset may have different glyphs (shapes) depending on the language being used. 2. General Requirements 2.1 Compatibility and Interoperability The DNS is essential to the entire Internet. Therefore, the protocol must not damage present DNS interoperability. It must make the minimum Expires 22nd of August 2000 [Page 2] Internet Draft Requirements of IDN 22nd Feb 2000 number of changes to existing protocols on all layers of the stack. It must continue to allow any system anywhere to resolve any domain name. The protocol must preserve the basic concept and facilities of domain names as described in [RFC1034]. It must maintain a single, global, universal, and consistent hierarchical namespace. The same name resolution request must generate the same response, regardless of the location or localization settings in the resolver, in the master server, and in any slave or caching servers involved in the resolution process. If the protocol allows more than one charset, it should also allow creation of caching servers that do not understand the charset in which a request or response is encoded. Such caching servers should work as well for IDNs as they do for current domain names. The caching server performs correctly if it gives the essentially the same answer (without the authoritative bit) as the master server would have if presented with the same request. A caching server must not return data in response to a query that would not have been returned if the same query had been presented to an authoritative server. This applies fully for the cases when: - The caching server does not know about IDN - The caching server implements the whole specification - The caching server implements a legal subset of the specification The protocol should be able to be upgraded at any time with new features and retain backwards compatibility with the current specification. The protocol may modify the DNS protocol [RFC1035] and other related work undertaken by the DNSEXT WG. However, these changes should be as small as possible and any changes must be approved by the DNSEXT WG. The protocol should be as simple as possible from the user's perspective. Ideally, users should not realize that IDN was added on to the existing DNS. A fall-back strategy or mechanism based upon ASCII may be needed during a transition period during deployment and adoption of IDN. Therefore, if an encoding is not mapped into ASCII, then there should be an ASCII- only representation compatible with the current DNS and there should be a way for a program to find the ASCII-only representation for IDN. The best solution is one that maintains maximum feasible compatibility with current DNS standards as long as it meets the other requirements in this document. 2.2 Internationalization Internationalized characters must be allowed to be represented and used in DNS names and records. The protocol must specify what charset is used when resolving domain names and how characters are encoded in DNS Expires 22nd of August 2000 [Page 3] Internet Draft Requirements of IDN 22nd Feb 2000 records. This document does not recommend any charset for I18N. If more than one charset is used in the protocol, then the protocol must specify all the charsets being used and for what purpose. A CCS(s) chosen must at least cover the range of characters as currently defined (and as being added) by ISO 10646/Unicode. CES(s) chosen should not encode ASCII characters differently depending on the other characters in the string. In other words, ASCII character should remain as specified in [US-ASCII]. The protocol must not invent a new CCS for the purpose of IDN only and should use existing CES. The charset(s) chosen should also be non-ambiguous. The protocol should not make any assumptions where in a domain name that internationalization might appear. In other words, it should not differentiate between any part of a domain name because this may impose a restriction on future internationalization efforts. The protocol should also not make any localized restrictions in the protocol. For example, an IDN implementation which only allows domain names to use a single local script would immediately restrict multinational organization. Because of the wide range of devices that use the DNS and the wide range of characteristics of international scripts, the protocol should allow more than one method of domain name input and display. However, there has to be a single way of encoding an internationalized domain name within the core of the DNS. 2.3 Localization The protocol must be able to handle localized requirement of different languages. For example, IDN must be able to handle bidirectional writing for scripts such as Arabic. Historically, "." has been the separator of labels in the domain names. The protocol should not use different separators for different languages. Most localization can be handled by the user interface. It should not matter how the domain names are input or presented, such as in a reverse order or bidirectional, or with the introduction of a new separator. However, the final wire format must be in canonical order. 2.4 Canonicalization Matching rules are a complicated process for IDN. Canonicalization of characters must follow precise and predictable rules to ensure consistency. [CHARREQ] is a recommended as a guide on canonicalization. The DNS has to match a domain name in a request with a domain name held Expires 22nd of August 2000 [Page 4] Internet Draft Requirements of IDN 22nd Feb 2000 in one or more zones. It also needs to sort names into order. It is expected that some sort of canonicalization algorithm will be used as the first step of this process. This section discusses some of the properties which will be required of that algorithm. The canonicalization algorithm might specify operations for case, ligature, and punctuation folding. In order to retain backwards compatibility with the current DNS, the protocol must retain the case-insensitive comparison for US-ASCII as specified in [RFC1035]. For example, Latin capital letter A (U+0041) must match Latin small letter A (U+0061). [UTR-21] describes some of the issues with case mapping. Case folding must not be locale dependent. For example, Latin capital letter I (U+0049) case folded to lower case in the Turkish context will become Latin small letter dotless I (U+0131). But in the English context, it will become Latin small letter I (U+0069). If other canonicalization is done, then it must be done before the domain name is resolved. Further, the canonicalization must be easily upgrade able as new languages and writing systems are added. Any conversion (case, ligature folding, punctuation folding, ...) from what the user enters into a client to what the client asks for resolution must be done identically on all requests from any client. If the protocol specifies a canonicalization algorithm, a caching server should perform correctly regardless of how much (or how little) of that algorithm it has implemented. [1 request to remove] If the protocol requires a canonicalization algorithm, all requests sent to a caching server must already be in the canonical form. The protocol should avoid inventing a new normalization form provided a technically sufficient one is available (such as in an ISO standard). 2.5 Operational Issues Zone files should remain easily editable. An IDN-capable resolver or server should not generate more traffic than a non-IDN-capable resolver or server would when resolving an ASCII-only domain name. The amount of traffic generated when resolving an IDN should be similar to that generated when resolving an ASCII-only name. The protocol should add no new centralized administration for the DNS. A domain administrator should be able to create internationalized names as easily as adding current domain names. Within a single zone, the zone manager must be able to define equivalence rules that suit the purpose of the zone, such as, but not limited to, and not necessarily, non-ASCII case folding, Unicode normalizations, Cyrillic/Latin folding, or traditional/simplified Expires 22nd of August 2000 [Page 5] Internet Draft Requirements of IDN 22nd Feb 2000 Chinese equivalence. Such defined equivalences must not remove equivalences that are assumed by (old or local-rule-ignorant) caches. The character set of a signed zone file should be capable of being the same as the character set of the unsigned zone file. The protocol must allow offline DNSSEC signing. It should be possible to look at the signed file and see that it is the same as the unsigned one. 2.6 Others The protocol may provide the same DNS resources using internationalized text as it currently provides using ASCII text. To get full semantics for IDN, an upgrade of the DNS and related software may be needed. 3. Technical Analysis There are many standard protocols and RFCs which are depend on domain names and have make various assumptions about the characters in them always conforming to [RFC-1034]. We expect that the protocols listed below to be affected: <...list the sets of RFCs which we would like to have an summary...> RFC821, RFC822, ... All idn protocol documents must fully detail the expected effects of leaking of the specified encoding to protocols other than the DNS resolution protocol. They must also contain a summary of the technical opinions of the IDN Working Group. 4. Security Considerations Any solution that meets the requirements in this document must not be less secure than the current DNS. Specifically, the mapping of internationalized host names to and from IP addresses must have the same characteristics as the mapping of today's host names. Specifying requirements for internationalized domain names does not itself raise any new security issues. However, any change to the DNS may affect the security of any protocol that relies on the DNS or on DNS names. A thorough evaluation of those protocols for security concerns will be needed when they are developed. In particular, IDNs must be compatible with DNSSEC. 5. References [CHARREQ] "Requirements for string identity matching and String Indexing", http://www.w3.org/TR/WD-charreq, July 1998, World Wide Web Consortium. [DNSEXT] "IETF DNS Extensions Working Group", namedroppers@internic.net, Olafur Gudmundson, Randy Bush. Expires 22nd of August 2000 [Page 6] Internet Draft Requirements of IDN 22nd Feb 2000 [RFC1034] "Domain Names - Concepts and Facilities", rfc1034.txt, November 1987, P. Mockapetris. [RFC1035] "Domain Names - Implementation and Specification", rfc1035.txt, November 1987, P. Mockapetris. [RFC1123] "Requirements for Internet Hosts -- Application and Support", rfc1123.txt, October 1989, R. Braden. [RFC1996] "A Mechanism for Prompt Notification of Zone Changes (DNS NOTIFY)", rfc1996.txt, August 1996, P. Vixie. [RFC2119] "Key words for use in RFCs to Indicate Requirement Levels", rfc2119.txt, March 1997, S. Bradner. [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 3.0", ISBN 0-201-61633-5. Described at http://www.unicode.org/unicode/standard/versions/ Unicode3.0.html [US-ASCII] Coded Character Set -- 7-bit American Standard Code for Information Interchange, ANSI X3.4-1986. [UTR15] "Unicode Normalization Forms", Unicode Technical Report #15, http://www.unicode.org/unicode/reports/tr15/, Nov 1999, M. Davis & M. Duerst, Unicode Consortium. [UTR21] "Case Mappings", Unicode Technical Report #21, http://www.unicode.org/unicode/reports/tr21/, Dec 1999, M. Davis, Unicode Consortium. Appendix A. Acknowledgements The editor gratefully acknowledges the contributions of: Harald Tveit Alvestrand Martin Duerst Patrik Faltstrom Andrew Draper Bill Manning Paul Hoffman James Seng Randy Bush Alan Barret Olafur Gudmundsson Karlsson Kent Dan Oscarsson J. William Semich RJ Atkinson Simon Josefsson Ned Freed Expires 22nd of August 2000 [Page 7]