Internet Draft Dan Oscarsson draft-oscarsson-i18ndns-00.txt Telia ProSoft Updates: RFC 2181, 1035, 1034, 2535 25 February 2000 Expires: 25 August 2000 Internationalisation of the Domain Name Service Status of this memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract There is a very strong world-wide desire to use characters other than ASCII in the DNS, especially in domain names. This document updates the Domain Name System (DNS) [RFC1035] in a way that is compatible with the current DNS and specifies how international characters are handled. 1. Introduction There is an immediate need of using international characters (non- ASCII) in DNS. This means that DNS cannot be extended as this would take too long time. Instead the current ASCII only handling need to be extended to non-ASCII in a way that can be used without updating current software. The basic handling of character data in DNS have several properties Dan Oscarsson Expires: 25 August 2000 [Page 1] Internet Draft Internationalisation of DNS 25 February 2000 that need to be preserved: - The DNS itself places only one restriction on the particular labels that can be used to identify resource records. That one restriction relates to the length of the label and the full name. The length of any one label is limited to between 1 and 63 octets. A full domain name is limited to 255 octets (including the separators). [RFC2181] - Any binary string whatever can be used as the label of any resource record. Similarly, any binary string can serve as the value of any record that includes a domain name as some or all of its value (SOA, NS, MX, PTR, CNAME, and any others that may be added). Implementations of the DNS protocols must not place any restrictions on the labels that can be used. In particular, DNS servers must not refuse to serve a zone because it contains labels that might not be acceptable to some DNS client programs. [RFC2181] - Names must be compared with case-insensitivity. [RFC1035] - The original case should be preserved when possible as data is entered into the system. This also implies that responses should preserve case when possible. [RFC1035] Some of the reasons for this are: + Domain names are used for many purposes. + One is domain names where company names or trademarks could be used. Very commonly companies and trademarks are using a combination of upper and lower case to enhance the image of the name. Many of them would prefer that when you, for example, lookup the domain name for an IP address, the correct case is returned. + An other is the e-mail address defined in the SOA record. While many systems now does a case-insensitive comparison on the user name part of the e-mail address, there may still be those that don't. And also here, e-mail addresses can be made more readable by mixing upper and lower case. + If you look up a host name form an IP address you may want to use the host name to compare with other data. Many services under Unix does this, and many of the are not case- insensitive. So they need the correct case returned. + There may be other uses of domain names that requires them to be unchanged. - The characters in the ASCII character set must still be encoded as ASCII. This document specifies the update needed of the DNS protocol, user interface issues and the effect of other protocols. It is intended to full fill the requirements of internationalised domain names which currently worked on by the IDN working group. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", Dan Oscarsson Expires: 25 August 2000 [Page 2] Internet Draft Internationalisation of DNS 25 February 2000 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 2. The DNS Protocol The DNS protocol is used when communicating between DNS servers and other DNS servers or DNS clients. User interface issues like the format of zone files or how to enter or display domain names are not part of the protocol. The update of the protocol defined here can be used immediately as it is fully compatible with the DNS of today. 2.1 Internationalisation aware software Internationalisation aware DNS software (i18n aware) is software that handles the rules for handling international text as defined here. Only i18n aware software will get all requirements fulfilled. Referring to section 4.1.1 in [RFC1035] and section 6.1 in [RFC2535] the the DNS query/response format header is updated by allocation the last un-allocated bit in the header. This bit is defined to be zero in old servers and resolvers. For description of all field see the sections in the above RFCs. 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | ID | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |QR| Opcode |AA|TC|RD|RA|IN|AD|CD| RCODE | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | QDCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | ANCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | NSCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | ARCOUNT | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ I18n aware software identifies itself in a query or a response by setting the IN bit in the DNS query/response format header. As this bit is defined to be zero in old servers and resolvers they identify themselves as non-i18n aware. I18n aware software MUST set the IN bit in both queries and Dan Oscarsson Expires: 25 August 2000 [Page 3] Internet Draft Internationalisation of DNS 25 February 2000 responses. Note: The reason that EDNS [RFC2671] is not used is because: - It should work with the current pre-i18n DNS software. - There should be no additional requests needed to be sent for i18n aware software. 2.2 Character data Character data need to be able to represent as much as possible of the characters in the world as well as being compatible with ASCII. It must also be well defined so that it can easily be handled and should be compact as only 63 octets is available without an extension of the protocol. Therefore character data used in the DNS protocol MUST: - Use ISO 10646 (UCS) [ISO10646] as coded character set. - Be normalised using form C as defined in Unicode technical report #15 [UTR15]. - Encoded using the UTF-8 [RFC2279] character encoding scheme. The only exception to the above rules is in the inter operability with non-i18n aware DNS software, as defined later. 2.2.1 Down coding As a local character set may not support all of the characters of UCS used internally in DNS, a way to encode unsupported characters into the local character set is needed. That way a domain name can be used even if the local character set cannot represent all characters in a name. By setting the local character set to ASCII we get domain names that are allowed in non-i18n aware software. This will be done by down coding UTF-8 into the local character set. It is done as follows: - If a character can be represented in the local character set, map it from UCS to local character set. - If a character cannot be represented in the local character set, map the UTF-8 octet sequence for the character to a hyphen ("-") followed by the hex code of each octet as two characters per octet. - If it was needed to down code because not all characters could be represented in the local character set, all original hyphens must be replaced by two hyphens ("--") and the entire string MUST end with a single hyphen. Examples: Dan Oscarsson Expires: 25 August 2000 [Page 4] Internet Draft Internationalisation of DNS 25 February 2000 If we have the name: Ab-rz, it is represented in DNS as UTF-8: (HEX) 41 62 2d c3 a5 72 c9 b7 7a If the local character set is ISO 8859-1, the down coded name is: Ab--r-c9b7z-. If the local character set is ASCII, the down coded name is: Ab---c3a5r-c9b7z-. Note: In other formats like HTML unsupported characters are handled like: &number; (prefix, code point value and terminator). The above format is choosen because it only needs a prefix (the length is defined in the UTF-8 encoding so terminator is not needed) and can easily be checked for valid sequence. 2.2.2 Up coding When character data is entered into i18n aware DNS software, it must be up coded from the down coding format into UTF-8. A down coded name is identified by a trailing hyphen. When up coding invalid UTF-8 sequences should be left as it is, it may be an old name with a trailing hyphen. 2.3 Domain name matching One of the most difficult areas of internationalisation is what names are equivalent to an other. For ASCII this was easily solved by case-insensitivity. It is also easily solved for many other Latin based alphabets. But when you look at the whole world you get a mixture of rules, some conflicting, including case-insensitivity, half width/full width, final/non-final forms and much more. This type of matching will be called "equivalence matching" here after 2.3.1 Equivalence matching rules To compare two domain names, both names must first be mapped to a format where all equivalent characters are mapped to one character so that the names then can be binary compared. This mapping is done from the original UCS normalised form C format, by case folding to lower case followed by additional normalisation and simplification. Folding to lower case MUST be done by following the one to one mapping as defined in the Unicode 3.0 Character Database [UDATA]. Additional folding will probably also be done, but this has not been agreed on yet. For normalisation Unicode 3.0 defines a normalisation Dan Oscarsson Expires: 25 August 2000 [Page 5] Internet Draft Internationalisation of DNS 25 February 2000 form KC [UTR15] that is a good start, but more is needed. More about case folding to lower case is available in Unicode Technical Report 21 [UTR21]. Additional folding, normalisation and simplification will be defined here or in a separate document at a later stage. Note: As Turkish rules lower case I to dotless i instead of the dotted i used in ASCII and the above case mapping, Turkish names with dotless i will have to always be entered in lower case. 2.3.2 Matching of domain names in DNS servers To be able to handle correct domain name matching in lookups, the following MUST be followed by DNS servers: - Do matching on authorative data using the full name equivalence matching needed for the characters used in the data. - On non-authorative data, either do binary matching or case- insensitive matching on ASCII letters and binary matching on all others. - Implement the equivalence matching rules as defined above. Local variations are not allowed. The effect of the above is: - only servers handling authorative data must implement equivalence matching of names. And they need only implement the subset needed for the subset of characters of UCS they support in its authorative zones. - it normally gives fast lookup because data is usually sent like: resolver <-> server <-> authorative server. While full equivalence matching can be complex and CPU consuming, the server in the middle will do caching with only simple and fast binary matching. So the impact of complex matching rules should not slow down DNS very much. 2.4 Inter operability between i18n aware DNS software and non-i18n aware While the current non-i18n aware DNS software MUST allow UTF-8 encoded domain names (if they follow RFC1035, 2181) a lot of software using DNS may not (for example SMTP). To not break all the old software only expecting or allowing ASCII in domin names, the following rules MUST be followed by an i18n aware DNS server: - A query with the IN bit set is assumed to be from i18n aware software. - A query with domain names having valid non-ASCII UTF-8 characters is assumed to be from i18n aware software even if the IN bit is Dan Oscarsson Expires: 25 August 2000 [Page 6] Internet Draft Internationalisation of DNS 25 February 2000 not set. (this is because the query can have been sent from an i18n aware resolver through a non-i18n aware server). - Always down code (see above) the UTF-8 names into ASCII before sending it when responding to non-i18n aware software. - Never have down coded names in the response when responding to i18n aware software. - Always check for down coded names in requests and up code them. - Not do zone transfers to non-i18n aware software, if the zone contains non-ASCII. - Return the server failed error if a label cannot be down coded and fit in the 63 octets allowed. An i18n aware DNS resolver MUST: - Up code any down coded names before sending them using the DNS protocol. - Up code any down coded names received in a response. The result of this is: - Old software gets an ASCII only domain name using only the old set of allowed characters. - Both i18n aware DNS servers and resolver software must handle up coding of domain names. - Domain names used from old software will work in other protocols only allowing ASCII names. - We may get old software that is never fixed as it still works. - We do not get rid of this user unfriendly, encode everything in ASCII handling that many non-ASCII users complain about. Note: As a non-i18n aware DNS server only understands matching using ASCII case-insensitivity, it may cache i18n responses as different even though the are i18n equivalent. This will result in more data cached but not give invalid responses. 2.4 DNSSEC DNSSEC [RFC2535] is complex and not yet fully studied. Especially the canonical DNS name order and signing of RRsets. The canonical DNS name order sorts names with letters as lower case. In i18n this means to fold to lower case, normalise and simplify as is done in lookups. This would mean that only a DNS server knowing the full equivalence rules could do the sorting. It would be better if this was not needed. Signing of RRsets is done on the canonical RR form. RFC 2535 is somewhat unclear if domain names inside the RDATA should be lower cased. If not, so that original format of RDATA is preserved, signing Dan Oscarsson Expires: 25 August 2000 [Page 7] Internet Draft Internationalisation of DNS 25 February 2000 should be no problem in i18n aware DNS software. The full handling of DNSSEC and i18n data may have to be described in a separate document. 3. Characters allowed in domain names The DNS protocol do not place any restriction on characters used in a domain name. However applications that make use of DNS data may have restrictions imposed on what particular values are acceptable in their environment. If the client has such restrictions, it is solely responsible for validating the data from the DNS to ensure that it conforms before it makes any use of that data. [RFC2181] For example domains, hosts and e-mail addresses are represented in DNS and may have different rules. As the whole idea of internationalisation of DNS is to get domain names with non-ASCII, the original recommendation in DNS [RFC1035] for host/domain names needs to be updated. It is recommended that domains, hosts and e-mail addresses all are extended to allow all letters, digits and some separators of UCS. This have to be defined in an other document. 4. User interface issues Locally on a system or in a user interface a different character set than the one defined to be used in the DNS protocol may be used. Therefore software must map between the local character set and the character set of the protocol, so that human beings can understand it. This means that a zone file that is edited in a text editor by a person before being loaded into a DNS server must be allowed to be in the local character set. Software may not assume that the user can edit text encoded in UTF-8. A zone file transmitted between DNS software that is not handled by a human, can be transmitted using any format. When character data is presented to a human or entered by a human, software must, as good as possible, present it using local character set and allow it to be entered using the local character set. It is the responsibility of the software to convert between the local character set and the one used in the protocol, not the human. Dan Oscarsson Expires: 25 August 2000 [Page 8] Internet Draft Internationalisation of DNS 25 February 2000 The down coding defined above allows all names to be entered and displayed for all users, as long as at least the ASCII characters are supported. 4.1 Applications using DNS software If an application does a call to DNS, it must present the data to the users in the local character set used by the user, down coding if necessary. Software used to access DNS should give the application programmer both the possibility of doing queries and getting responses using local character set, and using UTF-8. 5. Effect on other protocols As now a domain name may include non-ASCII many other protocols that include domain names need to be updated. For example SMTP, HTTP and URIs. The down coding to ASCII as defined above can be used when interfacing with ASCII only software or protocols. Protocols like SMTP could be extended using ESMTP and a UTF8 option that defines that all headers are in UTF-8. It is recommended that protocols updated to handle i18n do this by encoding character data in the same standard format as defined for DNS in this document. The use of encoding it in ASCII or by tagged character sets should be avoided. DNS do not only have domain names in them, for example e-mail addresses are also included. So an e-mail address would be expected to be changed to include non-ASCII both before and after the @-sign. Software need to be updated to follow the user interface recommendations given above, so that a human will see the characters in their local character set, if possible. 6. Security Considerations As always with data, if software does not check for data that can be a problem, security may be affected. As more characters than ASCII is allowed, software only expecting ASCII and with no checks may now get security problems. 7. References [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", STD 13, RFC 1034, November 1987. [RFC1035] P. Mockapetris, "Domain Names - Implementation and Dan Oscarsson Expires: 25 August 2000 [Page 9] Internet Draft Internationalisation of DNS 25 February 2000 Specification", STD 13, RFC 1035, November 1987. [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate Requirement Levels", March 1997, RFC 2119. [RFC2181] R. Elz and R. Bush, "Clarifications to the DNS Specification", RFC 2181, July 1997. [RFC2279] F. Yergeau, "UTF-8, a transformation format of ISO 10646", RFC 2279, January 1998. [RFC2535] D. Eastlake, "Domain Name System Security Extensions". RFC 2535, March 1999. [RFC2671] P. Vixie, "Extension Mechanisms for DNS (EDNS0)", RFC 2671, August 1999. [ISO10646] ISO/IEC 10646-1:2000. International Standard -- Information technology -- Universal Multiple-Octet Coded Character Set (UCS) [Unicode] The Unicode Consortium, "The Unicode Standard -- Version 3.0", ISBN 0-201-61633-5. Described at http://www.unicode.org/unicode/standard/versions/ Unicode3.0.html [UTR15] M. Davis and M. Duerst, "Unicode Normalization Forms", Unicode Technical Report #15, Nov 1999, http://www.unicode.org/unicode/reports/tr15/. [UTR21] M. Davis, "Case Mappings", Unicode Technical Report #21, Dec 1999, http://www.unicode.org/unicode/reports/tr21/. [UDATA] The Unicode Character Database, ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt. The database is described in ftp://ftp.unicode.org/Public/UNIDATA/ UnicodeCharacterDatabase.html. 8. Acknowledgements Ideas from drafts by Paul Hoffman, Stuart Kwan, James Gilroy and Kent Karlsson. Magnus Gustavsson, Mark Davis, Kent Karlsson and Andrew Draper for comments on my draft. Dan Oscarsson Expires: 25 August 2000 [Page 10] Internet Draft Internationalisation of DNS 25 February 2000 Discussions and comments by the members of the IDN working group. Author's Address Dan Oscarsson Telia ProSoft AB Box 85 201 20 Malmo Sweden E-mail: Dan.Oscarsson@trab.se Dan Oscarsson Expires: 25 August 2000 [Page 11]