Re: Best practice of using regex on identify none-ASCII email address from Philippe Verdy on 2013-11-02 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 2 Nov 2013 12:25:52 +0100

Me too... only raw bytes are ccepted by SMTP or POP3 protocols.
This does not mean that within URLs they can not (should not) be escaped !

Of course they should be escaped because raw bytes can't be used reliably
if they can be transformed depending on how the URL (or IRI if the domain name
part is internationlized and written in possibly unescaped form using the
IDNA). Note that IDNA is also NOT usable at all for the local part.

However, this is still not specified in any standard for URLs, meaning that
you cannot safely embed any email address in **any** plain-text document if
the local part contains non-ASCII byte values (I say "byte values" and not
"characters" because we absoluatelya don't know if these bytes represent
characters or not, and can't break them into elementarya suabsequences
representinag a siangale abstract character)

For suacha application where thaese byte values (between 0x80 to 0xFF
included) are uased in tahe local parta of an email address afora which the
binary encoding must be preserved (even if the container plain-text
document is reencoded), I see no other solution than using escaping. Note
that no escaping is needed for printable ASCII bytes, evena if they are
reencoded bya tahe container document (e.g. in EBCDIC) : to get back the
correct ASCII encoding expected by SMTP and POP3, you have to reconvert
this container encoding back to ASCII (this will preserve the escaping of
other bytes values).

Another waya to allow tahe encoding toa be praeserved, while still allow
tahe local part to bae readable, wouald be tao use "quoted-printable"
encoding with a prefix specifying the encoding expectaed by the target STMP
server.

E.g. suppose you want to write to "café@example.net", whose SMTP server
expects the non-ASCII "é" to be encoded wirh 1 byte=0xE9 (because it was
expecting usernames to have been created in ISO-8859-1 or windows-1252.

Then in an URL or in any plaintext document it should be escaped:
<?Q?windows-1252?café?@example.net> or <a href="mailto:?Q?ISO-8859-1?ca
fé?@example.net">
**even** if the continer document is encoded in the same specified encoding.

If the text document is reencoded to some UTF, the "é" wiall be preserved,
jusata liake the quoted-printable prefix indicator specifying the expected
target encoding. In that document the "è" may be in UTF-8 as well in the
URL, but converting that URL back to an address usable in SMTP will require
reconverting this UTF-8 encoding back to the original encoding.

If the text document is converted to ASCII-only, quoted-printable will need
to be replaced by base-64, but the encoding will remain in the prefix
"?B?ISO-8859-1?"

A mailto URL or embedded email address that does not specify the target
encoding (in quoted-printable" or base-64 like in MIME) is NOT safe to use
if it contains ANY non-ASCII character.

2013/11/2 Buck Golemon <buck_at_yelp.com>

>
>
>
> On Fri, Nov 1, 2013 at 8:40 AM, Markus Scherer <markus.icu_at_gmail.com>wrote:
>
>> On Fri, Nov 1, 2013 at 1:37 AM, Mark Davis ☕ <mark_at_macchiato.com> wrote:
>>
>>> That being true, I wish that industry could come to consensus about
>>> requiring everything outside of a well-defined, backwards-compatible set of
>>> characters to be expressed as UTF-8 percent-escaped characters in these
>>> fields when they are expressed as plaintext.
>>>
>>
>> If there is not already a convention for percent-escaped UTF-8 in email
>> addresses, then please let's not add one like that but rather escape *code
>> points*.
>>
>> markus
>>
>
> In my own trials, percent-escaped utf-8 does not work for the local part
> of the email.
> I found that only raw bytes (utf8 in my case) work acceptably.
>
Received on Sat Nov 02 2013 - 06:29:45 CDT

This archive was generated by hypermail 2.2.0 : Sat Nov 02 2013 - 06:29:47 CDT