RE: Best practice of using regex on identify none-ASCII email address from Shawn Steele on 2013-10-30 (Unicode Mail List Archive)

From: Shawn Steele <Shawn.Steele_at_microsoft.com>
Date: Wed, 30 Oct 2013 22:42:47 +0000

Mixed script stuff considerations are all supposed to be done by the mailbox administrator. It's perfectly valid for a domain to assign Latin addresses and also Cyrillic ones. Indeed for Cyrillic EAI, one probably would almost certainly require ASCII (eg: Latin) aliases during whatever the transition period is.

A German mailbox admins may only allow German letters and no other Latin characters in their mailbox names. Other admins may want to allow Latin characters with other scripts (CJK locales come to mind). And a Russian admin may provide all-Cyrillic mailboxes with all-Latin aliases to those names. (Hopefully that admin's being careful about homographs, but the standards still let the admin make the decisions).

The PUA isn't even forbidden (I'm hoping for a pIqaD alias some day).

-Shawn

From: unicode-bounce_at_unicode.org [mailto:unicode-bounce_at_unicode.org] On Behalf Of James Lin
Sent: Wednesday, October 30, 2013 2:58 PM
To: Paweł Dyda
Cc: cldr-users_at_unicode.org; unicode_at_unicode.org
Subject: Re: Best practice of using regex on identify none-ASCII email address

Hi
I am not expecting a single regular expression to solve all possible combination of scripts. What I am looking for probably (which may not be possible due to combination of scripts and mix scripts) is somewhere along the line of having individual scripts that validate by the regular expression. I am still thinking if it is possible to have regular expression for individual scripts only and not mix-match (for the time being) such as (i am being very high level here):

* Phags-pa scripts

     * Chinese: Traditional/Simplified
     * Mongolian
     * Sanskrit
     * ...

* Kana scripts

* Japanese: hirakana/Katakana
* ...

* Hebrew scripts

     * Yiddish
     * Hebrew
     * Bukhori
     * ...

* Latin scripts

     * English
     * Italian
     * ....

* Hangul scripts

* Korean

* Cyrillic Scripts

     * Russian
     * Bulgarian
     * Ukrainian
     * ...
By focusing on each scripts to derive a regular expression, I was wondering if such validation can be accomplished here.

Of course, RFC3696 standardize all email formatting rules and we can use such rule to validate the format before checking the scripts for validity.

Warm Regards,
-James Lin

From: Paweł Dyda <pawel.dyda_at_gmail.com<mailto:pawel.dyda_at_gmail.com>>
Date: Wednesday, October 30, 2013 at 2:19 PM
To: James Lin <james_lin_at_symantec.com<mailto:james_lin_at_symantec.com>>
Cc: "cldr-users_at_unicode.org<mailto:cldr-users_at_unicode.org>" <cldr-users_at_unicode.org<mailto:cldr-users_at_unicode.org>>, Unicode List <unicode_at_unicode.org<mailto:unicode_at_unicode.org>>
Subject: Re: Best practice of using regex on identify none-ASCII email address

Hi James,
I am not sure if you have seen my email, but... I believe Regular Expressions are not a valid tool for that job (that is validating Int'l email address format).

In the internal email I especially gave one specific example, where to my knowledge it is (nearly) impossible to use Regular Expression to validate email address.

The reason I gave was mixed-script scenario.

How can we ensure that we allow mixture of Hiragana, Katakana and Latin, while basically disallowing any other combinations with Latin (especially Latin + Cyrillic or Latin + Greek)?
I am really curious to know...
And of course there are several single-script (homographs and alike) attacks that we might want to prevent. I don't think it is even remotely possible with Regular Expressions. Please correct me if I am wrong.
Cheers,
Paweł.

2013/10/30 James Lin <James_Lin_at_symantec.com<mailto:James_Lin_at_symantec.com>>
Let me include the unicode alias as well for wider audience since this topic came up few times in the past.

From: James Lin <james_lin_at_symantec.com<mailto:james_lin_at_symantec.com>>
Date: Wednesday, October 30, 2013 at 1:11 PM
To: "cldr-users_at_unicode.org<mailto:cldr-users_at_unicode.org>" <cldr-users_at_unicode.org<mailto:cldr-users_at_unicode.org>>
Subject: Best practice of using regex on identify none-ASCII email address

Hi
does anyone has the best practice or guideline on how to validate none-ASCII email address by using regular expression?

I looked through RFC6531, CLDR repository and nothing has a solid example on how to validate none-ASCII email address.

thanks everyone.
-James
Received on Wed Oct 30 2013 - 17:44:49 CDT

This archive was generated by hypermail 2.2.0 : Wed Oct 30 2013 - 17:44:51 CDT