RE: Best practice of using regex on identify none-ASCII email address from Shawn Steele on 2013-10-30 (Unicode Mail List Archive)

From: Shawn Steele <Shawn.Steele_at_microsoft.com>
Date: Wed, 30 Oct 2013 23:23:38 +0000

For EAI (the question being asked), the entire address, local part and domain, are encoded in UTF-8.

-Shawn

From: unicode-bounce_at_unicode.org [mailto:unicode-bounce_at_unicode.org] On Behalf Of Philippe Verdy
Sent: Wednesday, October 30, 2013 4:08 PM
To: James Lin
Cc: Paweł Dyda; cldr-users_at_unicode.org; unicode_at_unicode.org
Subject: Re: Best practice of using regex on identify none-ASCII email address

You should not ttempt to detect scripts or even assume that they are encoded based on Unicode, in the username part ; all you can do is to break at the first "@" to split it between user name part and the domin name, then use the IDN specs to validate the domain name part.

* 1. Domain name part:

You may want to restrict only to internet domains (that must contain a dot before the TLD), and validate the TLD label in a list that you do not restrict for local usage only (such as .local or .localnet), or only for your own domain, but I suggest that you validte all these domains only by performing a MX request on your DNS server (this could take time to reply, unless you just check the TLD part, which should be cached most often, or using the DNS request only for domins not in a wellknown list of gTLD, plus all 2-letter ccTLD which are not in the private-use range of ISO 3166-1).

Note that to send a mail, you need a MX resolution on DNS to get the address of a mail server, but it does not mean it will be immediately and constantly reachable : the UIP you get may be temporrily unreachable (due to your ISP or local routing problems, or because the remote mail server is temporarily offine or overloaded). Performing an MX request however is much faster than trying to send a mail to it, because MX resoltuion will use your local DNS server cache and caches of offstream DNS servers of your ISP (you normally don't need to perform authoritative MX requests which requires recursive search from the root, bypassing all caches, and the scalability of the DNS system (so it's not a good policy to do it by default).

If you need security, authoritative DNS queries should be replaced by secure emails based on direct authentication with the mail server at strt of the SMTP session. authoritative DNS queries should be performed only if this authentication fails (in order to bypass incorrect data in DNS caches), but not automaticlly (this could be caused by problems on your own site), so delay these unchecked email addresses in your database (the problem may be solved without doing anything when your server will retry several minutes or hours later, when it will have successed in sending the validation email for your subscribers).

Do not insert in your database any email addresses coming from any source you don't trust for having received the approval by the mail address owner, or not obeying to the same explicit approval policy seen by that user, or that is not in a domain in your own control ; otherwise you risk being flagged as spamming and have your site blocked on various mail servers: you need to send the validation email without sending any other kind of advertising, except your own identity.

Note that instead of a domain, you *may* accept a host name with an IPv4 address (in decimal dotted format), or an IPv6 address (within [brackets], and in hexadecimal with colons), or some other host name formats for specific mail/messaging transport protocols you accept, for example "username@[irc:ircservernname:port:channelname]", or "username@{uuid}" using other punctuation not valid in domain names.

* 2. User name part:

There's no standard encoding there.

- Do not assume any encoding (unless you know the encoding used on each specific domain !). This part never obeys the IDNA.
- Every unrestricted byte in the printable 7-bit ASCII range, and all bytes in 0x80..0xFF are valid in any sequence.
- Only few punctuations of the ASCII range need to be checked according to the RFC's.
- Never "canonicalise" user names by forcing the capitalisation (not even for the basic Latin letters : user names could be encoded with Base-64 for example where letter case is significant), even if you can do it for the domain name part.

2013/10/30 James Lin <James_Lin_at_symantec.com<mailto:James_Lin_at_symantec.com>>
Hi
I am not expecting a single regular expression to solve all possible combination of scripts. What I am looking for probably (which may not be possible due to combination of scripts and mix scripts) is somewhere along the line of having individual scripts that validate by the regular expression. I am still thinking if it is possible to have regular expression for individual scripts only and not mix-match (for the time being) such as (i am being very high level here):

  * Phags-pa scripts

     * Chinese: Traditional/Simplified
     * Mongolian
     * Sanskrit
     * ...

  * Kana scripts

     * Japanese: hirakana/Katakana
     * ...

  * Hebrew scripts

     * Yiddish
     * Hebrew
     * Bukhori
     * …

  * Latin scripts

     * English
     * Italian
     * ….

  * Hangul scripts

     * Korean

  * Cyrillic Scripts

     * Russian
     * Bulgarian
     * Ukrainian
     * ...
By focusing on each scripts to derive a regular expression, I was wondering if such validation can be accomplished here.

Of course, RFC3696 standardize all email formatting rules and we can use such rule to validate the format before checking the scripts for validity.

Warm Regards,
-James Lin

From: Paweł Dyda <pawel.dyda_at_gmail.com<mailto:pawel.dyda_at_gmail.com>>
Date: Wednesday, October 30, 2013 at 2:19 PM
To: James Lin <james_lin_at_symantec.com<mailto:james_lin_at_symantec.com>>
Cc: "cldr-users_at_unicode.org<mailto:cldr-users_at_unicode.org>" <cldr-users_at_unicode.org<mailto:cldr-users_at_unicode.org>>, Unicode List <unicode_at_unicode.org<mailto:unicode_at_unicode.org>>

Subject: Re: Best practice of using regex on identify none-ASCII email address

Hi James,
I am not sure if you have seen my email, but... I believe Regular Expressions are not a valid tool for that job (that is validating Int'l email address format).

In the internal email I especially gave one specific example, where to my knowledge it is (nearly) impossible to use Regular Expression to validate email address.

The reason I gave was mixed-script scenario.

How can we ensure that we allow mixture of Hiragana, Katakana and Latin, while basically disallowing any other combinations with Latin (especially Latin + Cyrillic or Latin + Greek)?
I am really curious to know...
And of course there are several single-script (homographs and alike) attacks that we might want to prevent. I don't think it is even remotely possible with Regular Expressions. Please correct me if I am wrong.
Cheers,
Paweł.

2013/10/30 James Lin <James_Lin_at_symantec.com<mailto:James_Lin_at_symantec.com>>
Let me include the unicode alias as well for wider audience since this topic came up few times in the past.

From: James Lin <james_lin_at_symantec.com<mailto:james_lin_at_symantec.com>>
Date: Wednesday, October 30, 2013 at 1:11 PM
To: "cldr-users_at_unicode.org<mailto:cldr-users_at_unicode.org>" <cldr-users_at_unicode.org<mailto:cldr-users_at_unicode.org>>
Subject: Best practice of using regex on identify none-ASCII email address

Hi
does anyone has the best practice or guideline on how to validate none-ASCII email address by using regular expression?

I looked through RFC6531, CLDR repository and nothing has a solid example on how to validate none-ASCII email address.

thanks everyone.
-James

Received on Wed Oct 30 2013 - 18:25:33 CDT

This archive was generated by hypermail 2.2.0 : Wed Oct 30 2013 - 18:25:33 CDT