Re: Best practice of using regex on identify none-ASCII email address from Philippe Verdy on 2013-11-02 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 2 Nov 2013 16:43:39 +0100

2013/11/2 Steffen Daode <sdaoden_at_gmail.com>

> There is RFC 5322 which specifies the format of internet messages,
> and then there were the 3+ RFCs (RFC 6530-32) which simply
> redefine that format to be UTF-8 aware and its limits to deal with
> characters not octets (multiply line lengths etc. with 4).
> These UTF-8 extensions can only be used when directly interacting
> with a SMTP / (POP3, IMAP; RFCs 6856 and 6855 i think belong)
> server.
>

and all this is about interacting with SMTP servers in the SMTP protocol
(or related protocols). Nothing correctly describes the embedding in a text
document (plain-text, or even HTML, XML) that can itself be fully reencoded
(and that may not even accept all raw UTF-8 byte values) ! That's what
you've forgotten.

Using the %-escapes used only in the SMTP protocol makes the embedding in
documents really unreadable: this defeats completely the work done to
support native names in IDNA for the domain name part of the address, if
the local part can only be represented using %nn-escaped bytes of UTF-8
sequences.

So instead of reading and typing
    <café@glacé.example.net <http://xn--glac-epa.example.net>>
users will have to decipher (or type):
    <caf%C2%89_at_glacé.example.net>
Really poor ! Things would be better with:
    <?Q?UTF-8?café?@glacé.example.net <http://xn--glac-epa.example.net>>

which should work as long as all non-ASCII characters exist in both the
specified target encoding (UTF-8 here) and the envelope encoding (it could
be windows-1252 here, e.g. for this email from me, or the encoding chosen
during the transport to reach you in your mailbox).

Note also that the SMTP server configured in ISO-8859-1 will not accept
   To: <caf%C2%89_at_glacé.example.net>
but may accept
   To: <caf%C9_at_glacé.example.net>
or only the raw form
   To: <café@glacé.example.net <http://xn--glac-epa.example.net>>
or could accept the last two as *distinct* addresses (ech one needing its
own way to escape them in envelope formats.

However SMTP servers are supposed to understand MIME conventions in MIME
headers, they should be used here to solve the issue in other text
documents, outside SMTP itself. MIME already proposes quoted-printable
since long (it also proposes base-64), with clear identification of the
encoding in each protocol field.

Note also that the user interface of email agents does not have this
limitation, they can display directly the first form in their forms because
they know they are speaking SMTP, so they decipher these themselves,
independantly of the encoding of envelope formats. Here were' speaking
about situations where addresses are exchanged outside of SMTP, for example
in word processor documents, readme files...

In addition these newer RFCs are not followed on many SMTP servers that
absolutely don't understand these escapes or that have never accepted the
UTF-8 encoding, but still accept their own local 8-bit encoding **only** in
raw form.
Received on Sat Nov 02 2013 - 10:47:53 CDT

This archive was generated by hypermail 2.2.0 : Sat Nov 02 2013 - 10:47:54 CDT