L2/03-031
Internet Draft Paul Hoffman
draft-hoffman-imaa-00.txt IMC & VPNC
February 5, 2003 Adam M. Costello
Expires in six months UC Berkeley
Internationalizing Mail Addresses in Applications (IMAA)
Status of this Memo
This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other groups
may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference material
or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
The Internationalizing Domain Names in Applications (IDNA) specification
describes how to process domain names that have characters outside
the ASCII repertoire. A user who has an internationalized domain name
may want to have their full Internet mail address internationalized,
including the local part (that is, the part to the left of the "@").
This document describes how to use non-ASCII characters in local parts,
by defining internationalized local parts (ILPs), internationalized mail
addresses (IMAs), and a mechanism called IMAA for handling them in a
standard fashion.
1. Introduction
A mail address consists of local part, an at-sign (@), and a domain
name. The IDNA specification [IDNA] describes how to handle domain names
that have non-ASCII characters. This document describes how to handle
non-ASCII characters in the rest of the mail address.
This document explicitly does not talk about display names and comments
in mail addresses that appear in message headers [RFC2822]. MIME part
three [RFC2047] describes how use an extended set of characters in
message headers, and this document does not alter that specification.
This document is being discussed on the ietf-imaa mailing list. See
for information about subscribing
and the list's archive.
1.1 Relationship to IDNA
This document relies heavily on IDNA for both its concepts and its
justification. This document omits a great deal of the justification
and design information that might otherwise be found here because it is
identical to that in IDNA. Anyone reading this document needs to have
first read [IDNA], [PUNYCODE], [NAMEPREP], and [STRINGPREP].
The main differences between how IMAA treats local parts of mail
addresses and how IDNA treats domain labels are:
- The ACE prefix for internationalized local parts is different from the
ACE prefix for internationalized domain labels.
- The profile of Stringprep used for local parts (Mailprep) does not do
case folding, whereas the profile used for domain labels (Nameprep)
does do case folding.
[[ OPEN ISSUE: Maybe just reuse Nameprep. ]]
- Comparisons between local parts are not required to be
case-insensitive, whereas comparisons between domain labels are
required to be case-insensitive.
- There is no UseSTD3ASCIIRules flag for local parts.
1.2 Open issues
This section describes the issues that are known to be unresolved.
There may also be other issues we haven't thought of yet. This section
might be easier to follow after the rest of the draft has been read.
This section will be removed before the document is passed to the
IESG or RFC Editor for publication.
Throughout the draft, comments related to these open issues appear
inside brackets like this: [[ OPEN ISSUE: comments ]].
Rather than transform the entire local part as a single unit, another
approach is to pick out smaller pieces of the local part, and transform
each piece independently, analogous to the way labels are picked out of
a domain name and transformed independently. The tradeoff is complexity
versus compatibility with various unofficial conventions for structured
local parts, like owner-listname, user+tag, sublocal.local, path!user,
etc. If this approach is used, what are the delimiters? Perhaps all
non-alphanumeric ASCII characters. In that case, it might be more
convenient (for technical reasons) to use an ACE infix rather than an
ACE prefix.
Should we do case mapping in the Stringprep profile? That would allow
us to reuse Nameprep, rather than introduce a new profile. Another
advantage is that all non-ASCII local parts would effectively be
case-insensitive, even for legacy MTAs. For example, a user could
create an account on some third-party email provider, and could pick
a username that happens to be an ACE, even though the mail server is
ILP-unaware. The user's correspondents could use the non-ASCII form
with their ILP-aware user agents, and would not need to be careful to
type the correct case of the letters. Users today generally expect
local parts to be case-insensitive, and don't take care to remember the
exact case (even though [RFC2822] says they need to), because mail servers
have traditionally compared ASCII local parts case-insensitively.
Notice that [RFC2821] says that local parts "MAY be case-sensitive" and
"a host that expects to receive mail SHOULD avoid defining mailboxes
where the Local-part is case-sensitive". The disadvantage of reusing
Nameprep would be that bc@example.com would appear to the
recipient as bc@example.com, unlike Abc@example.com, which
keeps its case (when handled properly).
A related question is whether to flesh out the mixed-case annotation
idea into a precise algorithm, and require that. That would allow
internationalized local parts to be both case-preserving and
case-insensitive (even when compared by legacy MTAs), just like ASCII
local parts usually are.
If we don't use mixed-case annotation, should we try to allow
non-lowercase ACE local parts? For example, if iesg--blahblah
gets decoded to non-ASCII, should IESG--BLAHBLAH also get decoded
to non-ASCII? Local parts often (unfortunately) get converted to
all-uppercase or all-lowercase. It would not be safe to decode
IESG--BLAHBLAH unless it were guaranteed to refer to the same mailbox
as iesg--blahblah. This guarantee could be accomplished by an
administrative requirement that non-lowercase ACE local parts must
not be created unless they refer to the same mailbox as the lowercase
version. For most existing MTAs, this requirement would be obeyed
automatically, because local parts are case-insensitive in most existing
MTAs.
If we want to reject non-lowercase ACE forms in ToUnicode, should we
add a step to do it early, or let it happen at the end, by using an
exact comparison rather than a case-insensitive comparison? Early
rejection saves compute cycles, but only for bogo-ACEs that shouldn't be
used anyway. Late rejection using an exact comparison makes ToUnicode
simpler to implement.
The SMTP spec limits local parts to 64 characters, but is that a limit
on the quoted local part or on the dequoted local part? It's not clear.
The generate grammar for "local-part" in [RFC2822] is identical to the
grammar for "local-part" in [RFC2821] (which was not true between 821
and 822). [RFC2822] is still more permissive than [RFC2821] in that it
allows CFWS around the local-part, and the obsolete (i.e., interpret)
grammar allows the old 822 forms. Do we need to check the length of the
unquoted local part (inside ToASCII) or the quoted local part (outside
ToASCII)? The SMTP limit is 64, but should we impose a stricter limit
of 63, to ease reuse of Punycode implementations? Or should we instead
mention that 26-bit integers are sufficient not only for IDNA, with its
63-code-point limit, but also for IMAA, with its 64-code-point limit?
Should we keep the requirement about recognizing fullwidth at-signs? It
seems needed for consistency with IDNA's requirement about recognizing
fullwidth dots.
If we were to drop the at-sign requirement, it would become possible to
narrow our focus from "mail address slots" to "local part slots". But
would we want to do that? If we keep the at-sign requirement, it's a
moot point, because then we're talking about the whole address.
Do we need to say more about stored strings versus query strings?
Should we consider using punctuation other than hyphens in the ACE
prefix? Then we could use the same letters as IDNA. For example, if
the IDNA ACE prefix were bq--, the IMAA ACE prefix could be bq== or
bq##.
Should the prefix be recognized case-sensitively or case-insensitively?
Currently it hardly matters, we just need to pick one so that everyone's
ToASCII implementation agrees on whether to fail or not. But if we
change anything related to case, it might become important to do it one
way or the other.
2. Terminology
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
"MAY" in this document are to be interpreted as described in RFC 2119
[RFC2119].
Code point, Unicode, and ASCII are defined in [IDNA].
A "mail address" consists of a local-part, an at-sign, and a domain
name, in that order. The exact details of the syntax depend on the
context; for example, a "mailbox" in [RFC2821] (SMTP) and an "addr-spec"
in [RFC2822] (message format) are both mail addresses, but they define
slightly different syntaxes for local parts and domain names.
A "dequoted local part" is the simple literal text string that is the
intended "meaning" of a local part after it has undergone lexical
interpretation. A dequoted local part excludes optional white space,
comments, and lexical metacharacters (like backslashes and quotation
marks used to quote other characters). Dequoted local parts are
generally not allowed in protocols (like SMTP commands and message
headers), but they are needed by IMAA as an intermediate form.
An "internationalized local part" (ILP) is anything that satisfies
both of the following conditions: (1) It conforms to the same syntax
as a non-internationalized local-part except that non-ASCII Unicode
characters are allowed wherever ASCII letters are allowed. (2) After
it has been dequoted, the ToASCII operation can be applied to it
without failing (see section 4). The term "internationalized local
part" is a generalization, embracing both old ASCII local parts and
new non-ASCII local parts. Although most Unicode characters can
appear in internationalized local parts, ToASCII will fail for some
inputs. Anything that fails to satisfy condition 2 is not a valid
internationalized local part.
[[ OPEN ISSUE: If local parts are transformed in pieces, the second
condition would need to say "dequoted...and split into pieces..."
"...applied to each piece...". ]]
An "internationalized mail address" (IMA) consists of an
internationalized local part, an at-sign, and an internationalized
domain name [IDNA], in that order.
Equivalence of local parts is defined in terms of the ToASCII operation,
which constructs an ASCII form for a given dequoted local part, whether
or not the local part was already an ASCII local part. Local parts are
defined to be equivalent if their ASCII forms produced by ToASCII match
exactly.
To allow internationalized local parts to be handled by existing
applications, an "ACE local part" is used (ACE stands for ASCII
Compatible Encoding). An ACE local part is an internationalized
local part that can be rendered in ASCII and is equivalent to an
internationalized local part that cannot be rendered in ASCII. Given
any dequoted internationalized local part that cannot be rendered in
ASCII, the ToASCII operation will convert it to an equivalent dequoted
ACE local part (whereas an ASCII local part will be left unaltered
by ToASCII). ACE local parts are unsuitable for display to users.
The ToUnicode operation will convert any dequoted local part to an
equivalent dequoted non-ACE local part. In fact, an ACE local part is
formally defined to be any local part whose dequoted form would be
altered by ToUnicode (whereas non-ACE local parts are left unaltered
by ToUnicode). The ToASCII and ToUnicode operations are specified in
section 4.
Every dequoted ACE local part begins with a certain string of ASCII
characters, called the "ACE prefix for local parts" (or simply the "ACE
prefix" when the context is clear). It is specified in section 5.
[[ OPEN ISSUE: If local parts are split into pieces, it might be more
convenient to use an infix rather than a prefix. ]]
A "mail address slot" is defined in this document to be a protocol
element or a function argument or a return value (and so on) explicitly
designated for carrying a mail address. Mail address slots exist, for
example, in the MAIL and RCPT commands of the SMTP protocol, in the To:
and Received: fields of message headers, and in a mailto: URI in the
href attribute of an HTML tag. General text that just happens to
contain an mail address is not a mail address slot; for example, a mail
address appearing in the plain text body of a message is not occupying a
mail address slot.
An "ILP-aware mail address slot" is defined in this document to
be a mail address slot explicitly designated for carrying an
internationalized mail address as defined in this document. The
designation may be static (for example, in the specification of
the protocol or interface) or dynamic (for example, as a result of
negotiation in an interactive session).
An "ILP-unaware mail address slot" is defined in this document to be any
mail address slot that is not an ILP-aware mail address slot. Obviously,
this includes any mail address slot whose specification predates this
document.
3. Requirements and applicability
3.1 Requirements
IMAA conformance means adherence to the following four requirements:
1) In an internationalized mail address, the following characters MUST
be recognized as at-signs for separating the local part from the domain
name: U+0040 (commercial at), U+FF20 (fullwidth commercial at).
[[ OPEN ISSUE: Keep that requirement? ]]
2) Whenever a mail address is put into an ILP-unaware mail address
slot (see section 2), it MUST contain only ASCII characters. Given an
internationalized mail address, an equivalent mail address satisfying
this requirement can be obtained by applying ToASCII to the local
part as specified in section 4, changing the at-sign to U+0040, and
processing the domain name as specified in [IDNA].
3) ACE local parts obtained from mail address slots SHOULD be hidden
from users when it is known that the environment can handle the non-ACE
form, except when the ACE form is explicitly requested. When it is not
known whether or not the environment can handle the non-ACE form, the
application MAY use the non-ACE form (which might fail, such as by not
being displayed properly), or it MAY use the ACE form (which will look
unintelligle to the user). Given an internationalized local part, an
equivalent non-ACE local part can be obtained by applying the ToUnicode
operation as specified in section 4. When requirements 2 and 3 both
apply, requirement 2 takes precedence.
4) Two mail addresses MUST refer to the same mailbox if their domain
parts are equivalent (according to [IDNA]) and their local parts are
equivalent (that is, their ASCII forms obtained by dequoting and
applying ToASCII match exactly), regardless of whether the addresses use
the same form of at-sign.
3.2 Applicability
IMAA is applicable to all mail addresses in all mail address slots
except where it is explicitly excluded.
This implies that IMAA is applicable to protocols that predate IMAA.
Note that mail addresses occupying mail address slots in those protocols
MUST be in ASCII form (see section 3.1, requirement 2).
4. Conversion operations
An application converts a local part put into an ILP-unaware mail
address slot or displayed to a user. This section specifies the steps to
perform in the conversion, and the ToASCII and ToUnicode operations.
The input to ToASCII or ToUnicode is a dequoted local part that is a
sequence of Unicode code points (remember that all ASCII code points
are also Unicode code points). If a local part is represented using a
character set other than Unicode or US-ASCII, it will first need to be
transcoded to Unicode.
Starting from a local part, the steps that an application takes to do
the conversions are:
1) Decide whether the local part is a "stored string" or a "query
string" as described in [STRINGPREP]. If this conversion follows the
"queries" rule from [STRINGPREP], set the flag called "AllowUnassigned".
[[ OPEN ISSUE: We may need more here, possibly pointing to a different
section where we specify exactly what kinds of things are stored and
queries. ]]
2) Dequote the local part, that is, perform lexical interpretation and
remove all nonliteral characters. For example, for local parts that use
the lexical syntax of [RFC2821] (SMTP) or [RFC2822] (message format),
remove all unquoted unescaped white space and comments, and remove
backslashes and quotation marks used to quote other characters. The
result is a simple literal text string.
[[ OPEN ISSUE: If we want to pick out smaller pieces of local parts, we
need to insert that step here. ]]
3) Process the string with either the ToASCII or the ToUnicode operation
as appropriate. Typically, you use the ToASCII operation if you are
about to put the local part into an ILP-unaware slot, and you use the
ToUnicode operation if you are displaying the local part to a user.
4) Quote the local part if necessary. If the local part is to be placed
into a slot, the lexical syntax of the slot might not allow the local
part as a bare literal string; the string might need to be quoted. For
"mailbox" slots [RFC2821] and "addr-spec" slots [RFC2822] the following
action suffices: If the string contains any control characters, spaces,
or specials [RFC2821], or if it begins or ends with a dot, or contains
two consecutive dots, then convert it to a quoted-string by inserting a
backslash before every quotation mark, backslash, carriage return, and
linefeed, and then surrounding it with quotation marks.
The following two subsections define the ToASCII and ToUnicode
operations that are used in step 3.
This description of the protocol uses specific procedure names, names of
flags, and so on, in order to facilitate the specification of the
protocol. These names, as well as the actual steps of the procedures,
are not required of an implementation. In fact, any implementation which
has the same external behavior as specified in this document conforms to
this specification.
4.1 ToASCII
The ToASCII operation takes a sequence of Unicode code points that make
up a dequoted local part and transforms it into a sequence of code
points in the ASCII range (0..7F). If ToASCII succeeds, the original
sequence and the resulting sequence are equivalent dequoted local parts.
It is important to note that the ToASCII operation can fail. ToASCII
fails if any step of it fails. If any step of the ToASCII operation
fails, that string MUST NOT be used as an internationalized local
part. The method for deadling with this failure is application-specific.
The inputs to ToASCII are a sequence of code points, and the
AllowUnassigned flag. The output of ToASCII is either a sequence of
ASCII code points or a failure condition.
ToASCII never alters a sequence of code points that are all in the ASCII
range to begin with (although it could fail). Applying the ToASCII
operation multiple times has exactly the same effect as applying it just
once.
ToASCII consists of the following steps:
1. If all code points in the sequence are in the ASCII range (0..7F)
then skip to step 3.
2. Perform the steps of Mailprep (section 6) and fail if there is
an error. The AllowUnassigned flag is used in Mailprep.
3. If all code points in the sequence are in the ASCII range
(0..7F), then skip to step 8.
4. Verify that the sequence does NOT begin with the ACE prefix.
5. Encode the sequence using the encoding algorithm in [PUNYCODE]
and fail if there is an error.
6. Prepend the ACE prefix.
7. Convert all uppercase ASCII letters to lowercase.
[[ OPEN ISSUE: All we really need is deterministic case, not
necessarily lowercase. Another possibility would be to define an
exact algorithm for mixed-case annotations, and require that. ]]
8. Verify that the number of code points is in the range 1 to 64
inclusive.
[[ OPEN ISSUE: Is this the right limit and the right place to
check it? ]]
4.2 ToUnicode
The ToUnicode operation takes a sequence of Unicode code points that
make up a dequoted local part and returns a sequence of Unicode code
points. If the input sequence is a dequoted local part in ACE form,
then the result is an equivalent dequoted internationalized local part
that is not in ACE form, otherwise the original sequence is returned
unaltered.
ToUnicode never fails. If any step fails, then the original input
sequence is returned immediately in that step.
The ToUnicode output never contains more code points than its input.
Note that the number of octets needed to represent a sequence of code
points depends on the particular character encoding used.
The inputs to ToUnicode are a sequence of code points and the
AllowUnassigned flag. The output of ToUnicode is always a sequence of
Unicode code points.
1. If all code points in the sequence are in the ASCII range (0..7F)
then skip to step 3.
2. Perform the steps of Mailprep (section 6) and fail if there is an
error. The AllowUnassigned flag is used in Mailprep.
3. Verify that the sequence begins with the ACE prefix, and save a
copy of the sequence.
4. Remove the ACE prefix.
5. Decode the sequence using the decoding algorithm in [PUNYCODE]
and fail if there is an error. Save a copy of the result of
this step.
6. Apply ToASCII.
7. Verify that the result of step 6 matches the saved copy from
step 3, using an exact (case-sensitive) comparison.
[[ OPEN ISSUE: If we want non-lowercase ACE forms to be
recognized and decoded, we need to use a case-insensitive
comparison (like IDNA) rather than a case-sensitive comparison,
but this would not be safe unless an administrative requirement
is imposed or mixed-case annotations are required (see section
1.2). ]]
[[ OPEN ISSUE: Non-lowercase ACE forms could be detected and
rejected earlier, in step 3, by verifying the absence of
uppercase ASCII letters. Then it wouldn't matter whether the
comparison in step 7 is case-sensitive or case-insensitive,
because the result of step 6 is certain to contain only
non-uppercase ASCII characters (because ToASCII lowercases the
result of Punycode), and the saved copy from step 3 was verified
to contain no uppercase ASCII letters. But if either of those
preconditions changes in a future revision, it will affect how
early detection could be done. ]]
[[ OPEN ISSUE: If we decide to do case folding (that is, we use
Nameprep instead of Mailprep), and we still want ToUnicode to
reject non-lowercase ACE forms, we'll need to reconsider exactly
what we want and how it should be done. ]]
8. Return the saved copy from step 5.
5. ACE prefix
[[ Note to the IESG and Internet Draft readers: The two uses of the
string "IESG--" below are to be changed at time of publication to a
prefix which fulfills the requirements in the first paragraph. IANA will
assign this value. ]]
The ACE prefix, used in the conversion operations (section 4), is two
ASCII letters followed by two hyphen-minuses. It cannot be the same as
the prefix assigned to IDNA. The ToASCII and ToUnicode operations MUST
recognize the ACE prefix in a case-sensitive manner.
[[ OPEN ISSUE: Case-insensitive recognition would also work, the only
difference is whether IESG--nonascii would cause ToASCII to fail the
same way iesg--nonascii causes ToASCII to fail. But if we change
anything related to case, like whether we do case folding, or whether
non-lowercase ACEs are converted, we will need to reconsider whether the
prefix needs to be recognized case-sensitively or case-insensitively. ]]
[[ OPEN ISSUE: We might want to consider other possibilities. ]]
The ACE prefix for IMAA is "IESG--".
This means that an ACE local part might be "IESG--de-jg4avhby1noc0d",
where "de-jg4avhby1noc0d" is the part of the ACE local part that is
generated by the encoding steps in [PUNYCODE].
While all ACE local parts begin with the ACE prefix, not all local parts
beginning with the ACE prefix are necessarily ACE local parts. Non-ACE
local parts that begin with the ACE prefix will confuse users and SHOULD
NOT be allowed as mailbox names.
6. Mailprep: Stringprep profile for local parts
[[ OPEN ISSUE: If we decide to do case-mapping, we can reuse Nameprep
and delete this section. ]]
This section describes the Mailprep profile of [STRINGPREP] that is used
for local parts in ToASCII and ToUnicode.
6.1 Profile introduction
This profile defines the following, as required by [STRINGPREP]
- The intended applicability of the profile: internationalized
local parts
- The character repertoire that is the input and output to stringprep:
Unicode 3.2, specified in section 6.2
- The mappings used: specified in section 6.3
- The Unicode normalization used: specified in section 6.4
- The characters that are prohibited as output: specified in section 6.5
- Bidirectional character handling: specified in section 6.6
6.2. Character Repertoire
This profile uses Unicode 3.2, as defined in [STRINGPREP] Appendix
A.1.
6.3. Mapping
This profile specifies mapping using Table B.1 from [STRINGPREP]. Note
that this profile does not do case-mapping because local parts can be
case-sensitive.
6.4. Normalization
This profile specifies using Unicode normalization form KC, as described
in [STRINGPREP].
6.5. Prohibited Output
This profile specifies prohibiting using the following tables from
[STRINGPREP]:
Table C.1.2
Table C.2.2
Table C.3
Table C.4
Table C.5
Table C.6
Table C.7
Table C.8
Table C.9
6.6. Bidirectional characters
This profile specifies checking bidirectional strings as described
in [STRINGPREP] section 6.
6.7. Unassigned code points in internationalized local parts
If the processing in section 2 specifies that a list of unassigned code
points be used, the system uses table A.1 from [STRINGPREP] as its list
of unassigned code points.
7. References
7.1 Normative references
[IDNA] Patrik Faltstrom, et. al., "Internationalizing Domain Names in
Applications (IDNA)", draft-ietf-idn-idna.
[NAMEPREP] Paul Hoffman and Marc Blanchet, "Nameprep: A Stringprep
Profile for Internationalized Domain Names", draft-ietf-idn-nameprep.
[PUNYCODE] Adam Costello, "Punycode: An encoding of Unicode for use with
IDNA", draft-ietf-idn-punycode.
[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997, RFC 2119.
[RFC2821] John Klensin, "Simple Mail Transfer Protocol", April 2001, RFC
2821.
[RFC2822] Pete Resnick, "Internet Message Format", April 2001, RFC 2822.
[STRINGPREP] Paul Hoffman and Marc Blanchet, "Preparation of
Internationalized Strings ("stringprep")", RFC 3454.
7.2 Informative references
[RFC2047] Keith Moore, "MIME (Multipurpose Internet Mail Extensions)
Part Three: Message Header Extensions for Non-ASCII Text", November
1996, RFC 2047.
8. Security considerations
Because this document normatively refers to [IDNA], [NAMEPREP],
[PUNYCODE], and [STRINGPREP], it includes the security considerations
from those documents as well.
Internationalized local parts will cause mail addresses to become
longer, and possibly make it harder to keep lines in a header under 78
characters. Lines that are longer than 78 characters (which is a SHOULD
specification, not a MUST specification, in RFC 2822) could possibly
cause mail user agents to fail in ways that affect security.
9. IANA considerations
IANA will assign the ACE prefix in consultation with the IESG, possibly
following the same process used for [IDNA].
Section 6 defines a Stringprep profile that must be registered in the
IANA registry for Stringprep.
10. Authors' addresses
Paul Hoffman
Internet Mail Consortium and VPN Consortium
127 Segre Place
Santa Cruz, CA 95060 USA
phoffman@imc.org
Adam M. Costello
University of California, Berkeley
imaa-spec.amc @ nicemice.net