L2/03-031 Internet Draft Paul Hoffman draft-hoffman-imaa-00.txt IMC & VPNC February 5, 2003 Adam M. Costello Expires in six months UC Berkeley Internationalizing Mail Addresses in Applications (IMAA) Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract The Internationalizing Domain Names in Applications (IDNA) specification describes how to process domain names that have characters outside the ASCII repertoire. A user who has an internationalized domain name may want to have their full Internet mail address internationalized, including the local part (that is, the part to the left of the "@"). This document describes how to use non-ASCII characters in local parts, by defining internationalized local parts (ILPs), internationalized mail addresses (IMAs), and a mechanism called IMAA for handling them in a standard fashion. 1. Introduction A mail address consists of local part, an at-sign (@), and a domain name. The IDNA specification [IDNA] describes how to handle domain names that have non-ASCII characters. This document describes how to handle non-ASCII characters in the rest of the mail address. This document explicitly does not talk about display names and comments in mail addresses that appear in message headers [RFC2822]. MIME part three [RFC2047] describes how use an extended set of characters in message headers, and this document does not alter that specification. This document is being discussed on the ietf-imaa mailing list. See for information about subscribing and the list's archive. 1.1 Relationship to IDNA This document relies heavily on IDNA for both its concepts and its justification. This document omits a great deal of the justification and design information that might otherwise be found here because it is identical to that in IDNA. Anyone reading this document needs to have first read [IDNA], [PUNYCODE], [NAMEPREP], and [STRINGPREP]. The main differences between how IMAA treats local parts of mail addresses and how IDNA treats domain labels are: - The ACE prefix for internationalized local parts is different from the ACE prefix for internationalized domain labels. - The profile of Stringprep used for local parts (Mailprep) does not do case folding, whereas the profile used for domain labels (Nameprep) does do case folding. [[ OPEN ISSUE: Maybe just reuse Nameprep. ]] - Comparisons between local parts are not required to be case-insensitive, whereas comparisons between domain labels are required to be case-insensitive. - There is no UseSTD3ASCIIRules flag for local parts. 1.2 Open issues This section describes the issues that are known to be unresolved. There may also be other issues we haven't thought of yet. This section might be easier to follow after the rest of the draft has been read. This section will be removed before the document is passed to the IESG or RFC Editor for publication. Throughout the draft, comments related to these open issues appear inside brackets like this: [[ OPEN ISSUE: comments ]]. Rather than transform the entire local part as a single unit, another approach is to pick out smaller pieces of the local part, and transform each piece independently, analogous to the way labels are picked out of a domain name and transformed independently. The tradeoff is complexity versus compatibility with various unofficial conventions for structured local parts, like owner-listname, user+tag, sublocal.local, path!user, etc. If this approach is used, what are the delimiters? Perhaps all non-alphanumeric ASCII characters. In that case, it might be more convenient (for technical reasons) to use an ACE infix rather than an ACE prefix. Should we do case mapping in the Stringprep profile? That would allow us to reuse Nameprep, rather than introduce a new profile. Another advantage is that all non-ASCII local parts would effectively be case-insensitive, even for legacy MTAs. For example, a user could create an account on some third-party email provider, and could pick a username that happens to be an ACE, even though the mail server is ILP-unaware. The user's correspondents could use the non-ASCII form with their ILP-aware user agents, and would not need to be careful to type the correct case of the letters. Users today generally expect local parts to be case-insensitive, and don't take care to remember the exact case (even though [RFC2822] says they need to), because mail servers have traditionally compared ASCII local parts case-insensitively. Notice that [RFC2821] says that local parts "MAY be case-sensitive" and "a host that expects to receive mail SHOULD avoid defining mailboxes where the Local-part is case-sensitive". The disadvantage of reusing Nameprep would be that bc@example.com would appear to the recipient as bc@example.com, unlike Abc@example.com, which keeps its case (when handled properly). A related question is whether to flesh out the mixed-case annotation idea into a precise algorithm, and require that. That would allow internationalized local parts to be both case-preserving and case-insensitive (even when compared by legacy MTAs), just like ASCII local parts usually are. If we don't use mixed-case annotation, should we try to allow non-lowercase ACE local parts? For example, if iesg--blahblah gets decoded to non-ASCII, should IESG--BLAHBLAH also get decoded to non-ASCII? Local parts often (unfortunately) get converted to all-uppercase or all-lowercase. It would not be safe to decode IESG--BLAHBLAH unless it were guaranteed to refer to the same mailbox as iesg--blahblah. This guarantee could be accomplished by an administrative requirement that non-lowercase ACE local parts must not be created unless they refer to the same mailbox as the lowercase version. For most existing MTAs, this requirement would be obeyed automatically, because local parts are case-insensitive in most existing MTAs. If we want to reject non-lowercase ACE forms in ToUnicode, should we add a step to do it early, or let it happen at the end, by using an exact comparison rather than a case-insensitive comparison? Early rejection saves compute cycles, but only for bogo-ACEs that shouldn't be used anyway. Late rejection using an exact comparison makes ToUnicode simpler to implement. The SMTP spec limits local parts to 64 characters, but is that a limit on the quoted local part or on the dequoted local part? It's not clear. The generate grammar for "local-part" in [RFC2822] is identical to the grammar for "local-part" in [RFC2821] (which was not true between 821 and 822). [RFC2822] is still more permissive than [RFC2821] in that it allows CFWS around the local-part, and the obsolete (i.e., interpret) grammar allows the old 822 forms. Do we need to check the length of the unquoted local part (inside ToASCII) or the quoted local part (outside ToASCII)? The SMTP limit is 64, but should we impose a stricter limit of 63, to ease reuse of Punycode implementations? Or should we instead mention that 26-bit integers are sufficient not only for IDNA, with its 63-code-point limit, but also for IMAA, with its 64-code-point limit? Should we keep the requirement about recognizing fullwidth at-signs? It seems needed for consistency with IDNA's requirement about recognizing fullwidth dots. If we were to drop the at-sign requirement, it would become possible to narrow our focus from "mail address slots" to "local part slots". But would we want to do that? If we keep the at-sign requirement, it's a moot point, because then we're talking about the whole address. Do we need to say more about stored strings versus query strings? Should we consider using punctuation other than hyphens in the ACE prefix? Then we could use the same letters as IDNA. For example, if the IDNA ACE prefix were bq--, the IMAA ACE prefix could be bq== or bq##. Should the prefix be recognized case-sensitively or case-insensitively? Currently it hardly matters, we just need to pick one so that everyone's ToASCII implementation agrees on whether to fail or not. But if we change anything related to case, it might become important to do it one way or the other. 2. Terminology The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and "MAY" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Code point, Unicode, and ASCII are defined in [IDNA]. A "mail address" consists of a local-part, an at-sign, and a domain name, in that order. The exact details of the syntax depend on the context; for example, a "mailbox" in [RFC2821] (SMTP) and an "addr-spec" in [RFC2822] (message format) are both mail addresses, but they define slightly different syntaxes for local parts and domain names. A "dequoted local part" is the simple literal text string that is the intended "meaning" of a local part after it has undergone lexical interpretation. A dequoted local part excludes optional white space, comments, and lexical metacharacters (like backslashes and quotation marks used to quote other characters). Dequoted local parts are generally not allowed in protocols (like SMTP commands and message headers), but they are needed by IMAA as an intermediate form. An "internationalized local part" (ILP) is anything that satisfies both of the following conditions: (1) It conforms to the same syntax as a non-internationalized local-part except that non-ASCII Unicode characters are allowed wherever ASCII letters are allowed. (2) After it has been dequoted, the ToASCII operation can be applied to it without failing (see section 4). The term "internationalized local part" is a generalization, embracing both old ASCII local parts and new non-ASCII local parts. Although most Unicode characters can appear in internationalized local parts, ToASCII will fail for some inputs. Anything that fails to satisfy condition 2 is not a valid internationalized local part. [[ OPEN ISSUE: If local parts are transformed in pieces, the second condition would need to say "dequoted...and split into pieces..." "...applied to each piece...". ]] An "internationalized mail address" (IMA) consists of an internationalized local part, an at-sign, and an internationalized domain name [IDNA], in that order. Equivalence of local parts is defined in terms of the ToASCII operation, which constructs an ASCII form for a given dequoted local part, whether or not the local part was already an ASCII local part. Local parts are defined to be equivalent if their ASCII forms produced by ToASCII match exactly. To allow internationalized local parts to be handled by existing applications, an "ACE local part" is used (ACE stands for ASCII Compatible Encoding). An ACE local part is an internationalized local part that can be rendered in ASCII and is equivalent to an internationalized local part that cannot be rendered in ASCII. Given any dequoted internationalized local part that cannot be rendered in ASCII, the ToASCII operation will convert it to an equivalent dequoted ACE local part (whereas an ASCII local part will be left unaltered by ToASCII). ACE local parts are unsuitable for display to users. The ToUnicode operation will convert any dequoted local part to an equivalent dequoted non-ACE local part. In fact, an ACE local part is formally defined to be any local part whose dequoted form would be altered by ToUnicode (whereas non-ACE local parts are left unaltered by ToUnicode). The ToASCII and ToUnicode operations are specified in section 4. Every dequoted ACE local part begins with a certain string of ASCII characters, called the "ACE prefix for local parts" (or simply the "ACE prefix" when the context is clear). It is specified in section 5. [[ OPEN ISSUE: If local parts are split into pieces, it might be more convenient to use an infix rather than a prefix. ]] A "mail address slot" is defined in this document to be a protocol element or a function argument or a return value (and so on) explicitly designated for carrying a mail address. Mail address slots exist, for example, in the MAIL and RCPT commands of the SMTP protocol, in the To: and Received: fields of message headers, and in a mailto: URI in the href attribute of an HTML tag. General text that just happens to contain an mail address is not a mail address slot; for example, a mail address appearing in the plain text body of a message is not occupying a mail address slot. An "ILP-aware mail address slot" is defined in this document to be a mail address slot explicitly designated for carrying an internationalized mail address as defined in this document. The designation may be static (for example, in the specification of the protocol or interface) or dynamic (for example, as a result of negotiation in an interactive session). An "ILP-unaware mail address slot" is defined in this document to be any mail address slot that is not an ILP-aware mail address slot. Obviously, this includes any mail address slot whose specification predates this document. 3. Requirements and applicability 3.1 Requirements IMAA conformance means adherence to the following four requirements: 1) In an internationalized mail address, the following characters MUST be recognized as at-signs for separating the local part from the domain name: U+0040 (commercial at), U+FF20 (fullwidth commercial at). [[ OPEN ISSUE: Keep that requirement? ]] 2) Whenever a mail address is put into an ILP-unaware mail address slot (see section 2), it MUST contain only ASCII characters. Given an internationalized mail address, an equivalent mail address satisfying this requirement can be obtained by applying ToASCII to the local part as specified in section 4, changing the at-sign to U+0040, and processing the domain name as specified in [IDNA]. 3) ACE local parts obtained from mail address slots SHOULD be hidden from users when it is known that the environment can handle the non-ACE form, except when the ACE form is explicitly requested. When it is not known whether or not the environment can handle the non-ACE form, the application MAY use the non-ACE form (which might fail, such as by not being displayed properly), or it MAY use the ACE form (which will look unintelligle to the user). Given an internationalized local part, an equivalent non-ACE local part can be obtained by applying the ToUnicode operation as specified in section 4. When requirements 2 and 3 both apply, requirement 2 takes precedence. 4) Two mail addresses MUST refer to the same mailbox if their domain parts are equivalent (according to [IDNA]) and their local parts are equivalent (that is, their ASCII forms obtained by dequoting and applying ToASCII match exactly), regardless of whether the addresses use the same form of at-sign. 3.2 Applicability IMAA is applicable to all mail addresses in all mail address slots except where it is explicitly excluded. This implies that IMAA is applicable to protocols that predate IMAA. Note that mail addresses occupying mail address slots in those protocols MUST be in ASCII form (see section 3.1, requirement 2). 4. Conversion operations An application converts a local part put into an ILP-unaware mail address slot or displayed to a user. This section specifies the steps to perform in the conversion, and the ToASCII and ToUnicode operations. The input to ToASCII or ToUnicode is a dequoted local part that is a sequence of Unicode code points (remember that all ASCII code points are also Unicode code points). If a local part is represented using a character set other than Unicode or US-ASCII, it will first need to be transcoded to Unicode. Starting from a local part, the steps that an application takes to do the conversions are: 1) Decide whether the local part is a "stored string" or a "query string" as described in [STRINGPREP]. If this conversion follows the "queries" rule from [STRINGPREP], set the flag called "AllowUnassigned". [[ OPEN ISSUE: We may need more here, possibly pointing to a different section where we specify exactly what kinds of things are stored and queries. ]] 2) Dequote the local part, that is, perform lexical interpretation and remove all nonliteral characters. For example, for local parts that use the lexical syntax of [RFC2821] (SMTP) or [RFC2822] (message format), remove all unquoted unescaped white space and comments, and remove backslashes and quotation marks used to quote other characters. The result is a simple literal text string. [[ OPEN ISSUE: If we want to pick out smaller pieces of local parts, we need to insert that step here. ]] 3) Process the string with either the ToASCII or the ToUnicode operation as appropriate. Typically, you use the ToASCII operation if you are about to put the local part into an ILP-unaware slot, and you use the ToUnicode operation if you are displaying the local part to a user. 4) Quote the local part if necessary. If the local part is to be placed into a slot, the lexical syntax of the slot might not allow the local part as a bare literal string; the string might need to be quoted. For "mailbox" slots [RFC2821] and "addr-spec" slots [RFC2822] the following action suffices: If the string contains any control characters, spaces, or specials [RFC2821], or if it begins or ends with a dot, or contains two consecutive dots, then convert it to a quoted-string by inserting a backslash before every quotation mark, backslash, carriage return, and linefeed, and then surrounding it with quotation marks. The following two subsections define the ToASCII and ToUnicode operations that are used in step 3. This description of the protocol uses specific procedure names, names of flags, and so on, in order to facilitate the specification of the protocol. These names, as well as the actual steps of the procedures, are not required of an implementation. In fact, any implementation which has the same external behavior as specified in this document conforms to this specification. 4.1 ToASCII The ToASCII operation takes a sequence of Unicode code points that make up a dequoted local part and transforms it into a sequence of code points in the ASCII range (0..7F). If ToASCII succeeds, the original sequence and the resulting sequence are equivalent dequoted local parts. It is important to note that the ToASCII operation can fail. ToASCII fails if any step of it fails. If any step of the ToASCII operation fails, that string MUST NOT be used as an internationalized local part. The method for deadling with this failure is application-specific. The inputs to ToASCII are a sequence of code points, and the AllowUnassigned flag. The output of ToASCII is either a sequence of ASCII code points or a failure condition. ToASCII never alters a sequence of code points that are all in the ASCII range to begin with (although it could fail). Applying the ToASCII operation multiple times has exactly the same effect as applying it just once. ToASCII consists of the following steps: 1. If all code points in the sequence are in the ASCII range (0..7F) then skip to step 3. 2. Perform the steps of Mailprep (section 6) and fail if there is an error. The AllowUnassigned flag is used in Mailprep. 3. If all code points in the sequence are in the ASCII range (0..7F), then skip to step 8. 4. Verify that the sequence does NOT begin with the ACE prefix. 5. Encode the sequence using the encoding algorithm in [PUNYCODE] and fail if there is an error. 6. Prepend the ACE prefix. 7. Convert all uppercase ASCII letters to lowercase. [[ OPEN ISSUE: All we really need is deterministic case, not necessarily lowercase. Another possibility would be to define an exact algorithm for mixed-case annotations, and require that. ]] 8. Verify that the number of code points is in the range 1 to 64 inclusive. [[ OPEN ISSUE: Is this the right limit and the right place to check it? ]] 4.2 ToUnicode The ToUnicode operation takes a sequence of Unicode code points that make up a dequoted local part and returns a sequence of Unicode code points. If the input sequence is a dequoted local part in ACE form, then the result is an equivalent dequoted internationalized local part that is not in ACE form, otherwise the original sequence is returned unaltered. ToUnicode never fails. If any step fails, then the original input sequence is returned immediately in that step. The ToUnicode output never contains more code points than its input. Note that the number of octets needed to represent a sequence of code points depends on the particular character encoding used. The inputs to ToUnicode are a sequence of code points and the AllowUnassigned flag. The output of ToUnicode is always a sequence of Unicode code points. 1. If all code points in the sequence are in the ASCII range (0..7F) then skip to step 3. 2. Perform the steps of Mailprep (section 6) and fail if there is an error. The AllowUnassigned flag is used in Mailprep. 3. Verify that the sequence begins with the ACE prefix, and save a copy of the sequence. 4. Remove the ACE prefix. 5. Decode the sequence using the decoding algorithm in [PUNYCODE] and fail if there is an error. Save a copy of the result of this step. 6. Apply ToASCII. 7. Verify that the result of step 6 matches the saved copy from step 3, using an exact (case-sensitive) comparison. [[ OPEN ISSUE: If we want non-lowercase ACE forms to be recognized and decoded, we need to use a case-insensitive comparison (like IDNA) rather than a case-sensitive comparison, but this would not be safe unless an administrative requirement is imposed or mixed-case annotations are required (see section 1.2). ]] [[ OPEN ISSUE: Non-lowercase ACE forms could be detected and rejected earlier, in step 3, by verifying the absence of uppercase ASCII letters. Then it wouldn't matter whether the comparison in step 7 is case-sensitive or case-insensitive, because the result of step 6 is certain to contain only non-uppercase ASCII characters (because ToASCII lowercases the result of Punycode), and the saved copy from step 3 was verified to contain no uppercase ASCII letters. But if either of those preconditions changes in a future revision, it will affect how early detection could be done. ]] [[ OPEN ISSUE: If we decide to do case folding (that is, we use Nameprep instead of Mailprep), and we still want ToUnicode to reject non-lowercase ACE forms, we'll need to reconsider exactly what we want and how it should be done. ]] 8. Return the saved copy from step 5. 5. ACE prefix [[ Note to the IESG and Internet Draft readers: The two uses of the string "IESG--" below are to be changed at time of publication to a prefix which fulfills the requirements in the first paragraph. IANA will assign this value. ]] The ACE prefix, used in the conversion operations (section 4), is two ASCII letters followed by two hyphen-minuses. It cannot be the same as the prefix assigned to IDNA. The ToASCII and ToUnicode operations MUST recognize the ACE prefix in a case-sensitive manner. [[ OPEN ISSUE: Case-insensitive recognition would also work, the only difference is whether IESG--nonascii would cause ToASCII to fail the same way iesg--nonascii causes ToASCII to fail. But if we change anything related to case, like whether we do case folding, or whether non-lowercase ACEs are converted, we will need to reconsider whether the prefix needs to be recognized case-sensitively or case-insensitively. ]] [[ OPEN ISSUE: We might want to consider other possibilities. ]] The ACE prefix for IMAA is "IESG--". This means that an ACE local part might be "IESG--de-jg4avhby1noc0d", where "de-jg4avhby1noc0d" is the part of the ACE local part that is generated by the encoding steps in [PUNYCODE]. While all ACE local parts begin with the ACE prefix, not all local parts beginning with the ACE prefix are necessarily ACE local parts. Non-ACE local parts that begin with the ACE prefix will confuse users and SHOULD NOT be allowed as mailbox names. 6. Mailprep: Stringprep profile for local parts [[ OPEN ISSUE: If we decide to do case-mapping, we can reuse Nameprep and delete this section. ]] This section describes the Mailprep profile of [STRINGPREP] that is used for local parts in ToASCII and ToUnicode. 6.1 Profile introduction This profile defines the following, as required by [STRINGPREP] - The intended applicability of the profile: internationalized local parts - The character repertoire that is the input and output to stringprep: Unicode 3.2, specified in section 6.2 - The mappings used: specified in section 6.3 - The Unicode normalization used: specified in section 6.4 - The characters that are prohibited as output: specified in section 6.5 - Bidirectional character handling: specified in section 6.6 6.2. Character Repertoire This profile uses Unicode 3.2, as defined in [STRINGPREP] Appendix A.1. 6.3. Mapping This profile specifies mapping using Table B.1 from [STRINGPREP]. Note that this profile does not do case-mapping because local parts can be case-sensitive. 6.4. Normalization This profile specifies using Unicode normalization form KC, as described in [STRINGPREP]. 6.5. Prohibited Output This profile specifies prohibiting using the following tables from [STRINGPREP]: Table C.1.2 Table C.2.2 Table C.3 Table C.4 Table C.5 Table C.6 Table C.7 Table C.8 Table C.9 6.6. Bidirectional characters This profile specifies checking bidirectional strings as described in [STRINGPREP] section 6. 6.7. Unassigned code points in internationalized local parts If the processing in section 2 specifies that a list of unassigned code points be used, the system uses table A.1 from [STRINGPREP] as its list of unassigned code points. 7. References 7.1 Normative references [IDNA] Patrik Faltstrom, et. al., "Internationalizing Domain Names in Applications (IDNA)", draft-ietf-idn-idna. [NAMEPREP] Paul Hoffman and Marc Blanchet, "Nameprep: A Stringprep Profile for Internationalized Domain Names", draft-ietf-idn-nameprep. [PUNYCODE] Adam Costello, "Punycode: An encoding of Unicode for use with IDNA", draft-ietf-idn-punycode. [RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate Requirement Levels", March 1997, RFC 2119. [RFC2821] John Klensin, "Simple Mail Transfer Protocol", April 2001, RFC 2821. [RFC2822] Pete Resnick, "Internet Message Format", April 2001, RFC 2822. [STRINGPREP] Paul Hoffman and Marc Blanchet, "Preparation of Internationalized Strings ("stringprep")", RFC 3454. 7.2 Informative references [RFC2047] Keith Moore, "MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text", November 1996, RFC 2047. 8. Security considerations Because this document normatively refers to [IDNA], [NAMEPREP], [PUNYCODE], and [STRINGPREP], it includes the security considerations from those documents as well. Internationalized local parts will cause mail addresses to become longer, and possibly make it harder to keep lines in a header under 78 characters. Lines that are longer than 78 characters (which is a SHOULD specification, not a MUST specification, in RFC 2822) could possibly cause mail user agents to fail in ways that affect security. 9. IANA considerations IANA will assign the ACE prefix in consultation with the IESG, possibly following the same process used for [IDNA]. Section 6 defines a Stringprep profile that must be registered in the IANA registry for Stringprep. 10. Authors' addresses Paul Hoffman Internet Mail Consortium and VPN Consortium 127 Segre Place Santa Cruz, CA 95060 USA phoffman@imc.org Adam M. Costello University of California, Berkeley imaa-spec.amc @ nicemice.net