Re: Best practice of using regex on identify none-ASCII email address from Philippe Verdy on 2013-10-30 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Thu, 31 Oct 2013 00:08:09 +0100

You should not ttempt to detect scripts or even assume that they are
encoded based on Unicode, in the username part ; all you can do is to
break at the first "@" to split it between user name part and the domin
name, then use the IDN specs to validate the domain name part.

* 1. Domain name part:

You may want to restrict only to internet domains (that must contain a dot
before the TLD), and validate the TLD label in a list that you do not
restrict for local usage only (such as .local or .localnet), or only for
your own domain, but I suggest that you validte all these domains only by
performing a MX request on your DNS server (this could take time to reply,
unless you just check the TLD part, which should be cached most often, or
using the DNS request only for domins not in a wellknown list of gTLD, plus
all 2-letter ccTLD which are not in the private-use range of ISO 3166-1).

Note that to send a mail, you need a MX resolution on DNS to get
the address of a mail server, but it does not mean it will be immediately
and constantly reachable : the UIP you get may be temporrily unreachable
(due to your ISP or local routing problems, or because the remote mail
server is temporarily offine or overloaded). Performing an MX request
however is much faster than trying to send a mail to it, because MX
resoltuion will use your local DNS server cache and caches of offstream DNS
servers of your ISP (you normally don't need to perform authoritative MX
requests which requires recursive search from the root, bypassing all
caches, and the scalability of the DNS system (so it's not a good policy to
do it by default).

If you need security, authoritative DNS queries should be replaced by
secure emails based on direct authentication with the mail server at strt
of the SMTP session. authoritative DNS queries should be performed only if
this authentication fails (in order to bypass incorrect data in DNS
caches), but not automaticlly (this could be caused by problems on your own
site), so delay these unchecked email addresses in your database (the
problem may be solved without doing anything when your server will retry
several minutes or hours later, when it will have successed in sending the
validation email for your subscribers).

Do not insert in your database any email addresses coming from any source
you don't trust for having received the approval by the mail address owner,
or not obeying to the same explicit approval policy seen by that user, or
that is not in a domain in your own control ; otherwise you risk being
flagged as spamming and have your site blocked on various mail servers: you
need to send the validation email without sending any other kind of
advertising, except your own identity.

Note that instead of a domain, you *may* accept a host name with an IPv4
address (in decimal dotted format), or an IPv6 address (within [brackets],
and in hexadecimal with colons), or some other host name formats for
specific mail/messaging transport protocols you accept, for example
"username@[irc:ircservernname:port:channelname]", or "username@{uuid}"
using other punctuation not valid in domain names.

* 2. User name part:

There's no standard encoding there.

- Do not assume any encoding (unless you know the encoding used on each
specific domain !). This part never obeys the IDNA.
- Every unrestricted byte in the printable 7-bit ASCII range, and all bytes
in 0x80..0xFF are valid in any sequence.
- Only few punctuations of the ASCII range need to be checked according to
the RFC's.
- Never "canonicalise" user names by forcing the capitalisation (not even
for the basic Latin letters : user names could be encoded with *B*ase-64
for example where letter case is significant), even if you can do it for
the domain name part.

2013/10/30 James Lin <James_Lin_at_symantec.com>

> Hi
> I am not expecting a single regular expression to solve all possible
> combination of scripts. What I am looking for probably (which may not be
> possible due to combination of scripts and mix scripts) is somewhere along
> the line of having individual scripts that validate by the regular
> expression. I am still thinking if it is possible to have regular
> expression for individual scripts only and not mix-match (for the time
> being) such as (i am being very high level here):
>
> - Phags-pa scripts
> - Chinese: Traditional/Simplified
> - Mongolian
> - Sanskrit
> - ...
> - Kana scripts
> - Japanese: hirakana/Katakana
> - ...
> - Hebrew scripts
> - Yiddish
> - Hebrew
> - Bukhori
> - …
> - Latin scripts
> - English
> - Italian
> - ….
> - Hangul scripts
> - Korean
> - Cyrillic Scripts
> - Russian
> - Bulgarian
> - Ukrainian
> - ...
>
> By focusing on each scripts to derive a regular expression, I was
> wondering if such validation can be accomplished here.
>
> Of course, RFC3696 standardize all email formatting rules and we can use
> such rule to validate the format before checking the scripts for validity.
>
> Warm Regards,
> -James Lin
>
>
>
> From: Paweł Dyda <pawel.dyda_at_gmail.com>
> Date: Wednesday, October 30, 2013 at 2:19 PM
> To: James Lin <james_lin_at_symantec.com>
> Cc: "cldr-users_at_unicode.org" <cldr-users_at_unicode.org>, Unicode List <
> unicode_at_unicode.org>
>
> Subject: Re: Best practice of using regex on identify none-ASCII email
> address
>
> Hi James,
>
> I am not sure if you have seen my email, but... I believe Regular
> Expressions are not a valid tool for that job (that is validating Int'l
> email address format).
>
> In the internal email I especially gave one specific example, where to my
> knowledge it is (nearly) impossible to use Regular Expression to validate
> email address.
>
> The reason I gave was mixed-script scenario.
>
> How can we ensure that we allow mixture of Hiragana, Katakana and Latin,
> while basically disallowing any other combinations with Latin (especially
> Latin + Cyrillic or Latin + Greek)?
> I am really curious to know...
>
> And of course there are several single-script (homographs and alike)
> attacks that we might want to prevent. I don't think it is even remotely
> possible with Regular Expressions. Please correct me if I am wrong.
>
> Cheers,
> Paweł.
>
>
> 2013/10/30 James Lin <James_Lin_at_symantec.com>
>
>> Let me include the unicode alias as well for wider audience since this
>> topic came up few times in the past.
>>
>> From: James Lin <james_lin_at_symantec.com>
>> Date: Wednesday, October 30, 2013 at 1:11 PM
>> To: "cldr-users_at_unicode.org" <cldr-users_at_unicode.org>
>> Subject: Best practice of using regex on identify none-ASCII email
>> address
>>
>> Hi
>> does anyone has the best practice or guideline on how to validate
>> none-ASCII email address by using regular expression?
>>
>> I looked through RFC6531, CLDR repository and nothing has a solid example
>> on how to validate none-ASCII email address.
>>
>> thanks everyone.
>> -James
>>
>
>
Received on Wed Oct 30 2013 - 18:10:56 CDT

This archive was generated by hypermail 2.2.0 : Wed Oct 30 2013 - 18:10:56 CDT