This gist is an exploration of how to validate email addresses with regex as close as possible to the specification in RFC 6531.
Here is the pattern:
/^(?<localPart>(?<dotString>[0-9a-z!#$%&'*+\-\/=?^_`\{|\}~\u{80}-\u{10FFFF}]+(\.[0-9a-z!#$%&'*+\-\/=?^_`\{|\}~\u{80}-\u{10FFFF}]+)*)|(?<quotedString>"([\x20-\x21\x23-\x5B\x5D-\x7E\u{80}-\u{10FFFF}]|\\[\x20-\x7E])*"))(?<!.{64,})@(?<domainOrAddressLiteral>(?<addressLiteral>\[((?<IPv4>\d{1,3}(\.\d{1,3}){3})|(?<IPv6Full>IPv6:[0-9a-f]{1,4}(:[0-9a-f]{1,4}){7})|(?<IPv6Comp>IPv6:([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,5})?::([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,5})?)|(?<IPv6v4Full>IPv6:[0-9a-f]{1,4}(:[0-9a-f]{1,4}){5}:\d{1,3}(\.\d{1,3}){3})|(?<IPv6v4Comp>IPv6:([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,3})?::([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,3}:)?\d{1,3}(\.\d{1,3}){3})|(?<generalAddressLiteral>[a-z0-9\-]*[[a-z0-9]:[\x21-\x5A\x5E-\x7E]+))\])|(?<Domain>(?!.{256,})(([0-9a-z\u{80}-\u{10FFFF}]([0-9a-z\-\u{80}-\u{10FFFF}]*[0-9a-z\u{80}-\u{10FFFF}])?))(\.([0-9a-z\u{80}-\u{10FFFF}]([0-9a-z\-\u{80}-\u{10FFFF}]*[0-9a-z\u{80}-\u{10FFFF}])?))*))$/iu
The pattern is also posted on regex101.com.
RFC 6531 is the most recent specification for email addresses. It extends the prior standard RFC 5321 to add support for "internationalized" email addresses.
Let's first look at how the rules according to RFC 5321.
Domain = sub-domain *("." sub-domain)
sub-domain = Let-dig [Ldh-str]
Let-dig = ALPHA / DIGIT
Ldh-str = *( ALPHA / DIGIT / "-" ) Let-dig
address-literal = "[" ( IPv4-address-literal /
IPv6-address-literal /
General-address-literal ) "]"
; See Section 4.1.3
Mailbox = Local-part "@" ( Domain / address-literal )
Local-part = Dot-string / Quoted-string
; MAY be case-sensitive
Dot-string = Atom *("." Atom)
Atom = 1*atext
Quoted-string = DQUOTE *QcontentSMTP DQUOTE
QcontentSMTP = qtextSMTP / quoted-pairSMTP
quoted-pairSMTP = %d92 %d32-126
; i.e., backslash followed by any ASCII
; graphic (including itself) or SPace
qtextSMTP = %d32-33 / %d35-91 / %d93-126
; i.e., within a quoted string, any
; ASCII graphic or space is permitted
; without blackslash-quoting except
; double-quote and the backslash itself.
The options for the middle component of address-literal
s are defined as follows in RFC 5321, Section 4.1.3
IPv4-address-literal = Snum 3("." Snum)
IPv6-address-literal = "IPv6:" IPv6-addr
General-address-literal = Standardized-tag ":" 1*dcontent
Standardized-tag = Ldh-str
; Standardized-tag MUST be specified in a
; Standards-Track RFC and registered with IANA
dcontent = %d33-90 / ; Printable US-ASCII
%d94-126 ; excl. "[", "\", "]"
Snum = 1*3DIGIT
; representing a decimal integer
; value in the range 0 through 255
IPv6-addr = IPv6-full / IPv6-comp / IPv6v4-full / IPv6v4-comp
IPv6-hex = 1*4HEXDIG
IPv6-full = IPv6-hex 7(":" IPv6-hex)
IPv6-comp = [IPv6-hex *5(":" IPv6-hex)] "::"
[IPv6-hex *5(":" IPv6-hex)]
; The "::" represents at least 2 16-bit groups of
; zeros. No more than 6 groups in addition to the
; "::" may be present.
IPv6v4-full = IPv6-hex 5(":" IPv6-hex) ":" IPv4-address-literal
IPv6v4-comp = [IPv6-hex *3(":" IPv6-hex)] "::"
[IPv6-hex *3(":" IPv6-hex) ":"]
IPv4-address-literal
; The "::" represents at least 2 16-bit groups of
; zeros. No more than 4 groups in addition to the
; "::" and IPv4-address-literal may be present.
atext
above is defined in RFC 5322 as follows.
atext = ALPHA / DIGIT / ; Printable US-ASCII
"!" / "#" / ; characters not including
"$" / "%" / ; specials. Used for atoms.
"&" / "'" /
"*" / "+" /
"-" / "/" /
"=" / "?" /
"^" / "_" /
"`" / "{" /
"|" / "}" /
"~"
However RFC 6531, makes the following updates to these rules:
sub-domain =/ U-label
; extend the definition of sub-domain in RFC 5321, Section 4.1.2
atext =/ UTF8-non-ascii
; extend the implicit definition of atext in
; RFC 5321, Section 4.1.2, which ultimately points to
; the actual definition in RFC 5322, Section 3.2.3
qtextSMTP =/ UTF8-non-ascii
; extend the definition of qtextSMTP in RFC 5321, Section 4.1.2
U-labels
, as defined in RFC 5321, Section 4.1.2, have requirements that would be difficult to enforce via a simple regex pattern (such as the sequence of characters being in Normalization Form C). For purposes of this regex, I will assume that they are any sequence of ASCII characters allowable in the sub-domain
s and non-ASCII Unicode characters. Note that the maximum length of the entire domain is 255 characters per RFC 2181 Section 11.
UTF8-non-ascii
is defined as follows in RFC:
UTF-8 characters can be defined in terms of octets using the
following ABNF [RFC5234], taken from [RFC3629]:
UTF8-non-ascii = UTF8-2 / UTF8-3 / UTF8-4
UTF8-2 = <Defined in Section 4 of RFC3629>
UTF8-3 = <Defined in Section 4 of RFC3629>
UTF8-4 = <Defined in Section 4 of RFC3629>
UTF8-2
, UTF8-3
, and UTF8-4
are defined in RFC 3629 Section 4 as follows:
UTF8-octets = *( UTF8-char )
UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1 = %x00-7F
UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
%xF4 %x80-8F 2( UTF8-tail )
UTF8-tail = %x80-BF
For the regex flavor that we are using, we can treat our characters as Unicode code points and treat any character as with a code point higher than 80 as a UTF8-non-ascii
.
This pattern is quite a monstrosity. so I will break it down and explain how it works. As I break down each level I will use the marking ▒▒▒
inside capture groups when I omit the capture group contents for brevity.
First note the flags i
and u
at the end of the pattern. The i
means that the pattern is case-insensitive, which makes some of the classes simpler as it eliminates the need to specify both cases. The u
pattern indicates that characters are parsed as Unicode code points, and it is also necessary to use to the \u{Hex codepoint}
syntax.
Now, remember the rule Mailbox = Local-part "@" ( Domain / address-literal )
from the RFC specifications. We can reflect this rule in regex as follows.
/^(?<localPart>▒▒▒)(?<!.{64,})@(?<domainOrAddressLiteral>(?<addressLiteral>▒▒▒)|(?<Domain>(?!.{256,})▒▒▒))$/iu
We start with an anchor ^
to the front of the string, and end with an anchor $
to the back of the string. This means that the pattern will only match strings that form an email address with no other text. Remove these anchors if you want to find email addresses inside of text rather than validate a string that is supposed to be an email address.
Next the email has a named capture group (?<localPart>▒▒▒)
followed by a negative lookback (?<!.{64,})
followed by a the character literal @
. This means that pattern will match a localpart
that appears directly before an atmark, except when that sequence before the atmark is 64 characters or longer.
Similarly, there is another capture group immediately following the atmark for the mail host portion (?<domainOrAddressLiteral>▒▒▒)
. The target mail host can be specified either as an address literal or by domain name, so the domainOrAddressLiteral
capture group itself contains two nested capture groups, joined by the OR operator |
: (?<addressLiteral>▒▒▒)
and (?<Domain>(?!.{256,})▒▒▒)
. The second of these starts with a negative lookahead: (?!.{64,})
. Similar to the negative lookback for the localpart
, this will prevent matches on domains longer than 255 characters.
The local part is defined in the RFC spec as: Local-part = Dot-string / Quoted-string
.
This is reflected in the regex as two nested capture groups joined by the |
operator.
(?<localPart>(?<dotString>▒▒▒)|(?<quotedString>▒▒▒))
The important rules for dot strings are:
Dot-string = Atom *("." Atom)
Atom = 1*atext
atext = ALPHA / DIGIT / ; Printable US-ASCII
"!" / "#" / ; characters not including
"$" / "%" / ; specials. Used for atoms.
"&" / "'" /
"*" / "+" /
"-" / "/" /
"=" / "?" /
"^" / "_" /
"`" / "{" /
"|" / "}" /
"~"
atext =/ UTF8-non-ascii
These can be reflected in regex as follows:
(?<dotString>[0-9a-z!#$%&'*+\-\/=?^_`\{|\}~\u{80}-\u{10FFFF}]+(\.[0-9a-z!#$%&'*+\-\/=?^_`\{|\}~\u{80}-\u{10FFFF}]+)*)
In this part of the regex the subpattern [0-9a-z!#$%&'*+\-\/=?^_`\{|\}~\u{80}-\u{10FFFF}]+
appears twice. This represents the Atom
. It is formed with a long class in [ ]
followed by a Kleene plus (+
). This pattern will capture any the characters in the original definition of atext
plus non-ASCII Unicode characters added in the internationalization update.
The second time this subpattern appears, it is inside a capture group with \.
at the front of the capture group and with a Kleene star operator (*
) attached the capture group. That means that a period followed by atext
can repeat any number of times, including zero.
This section will match the bold portions of the following examples:
- user%example.com@example.org
- user-@example.org
- postmaster@[123.123.123.123]
- медведь@с-балалайкой.рф
Here are the relevant rules for quoted strings:
Quoted-string = DQUOTE *QcontentSMTP DQUOTE
QcontentSMTP = qtextSMTP / quoted-pairSMTP
quoted-pairSMTP = %d92 %d32-126
; i.e., backslash followed by any ASCII
; graphic (including itself) or SPace
qtextSMTP = %d32-33 / %d35-91 / %d93-126
qtextSMTP =/ UTF8-non-ascii
These are reflected in the regex as follows:
(?<quotedString>"([\x20-\x21\x23-\x5B\x5D-\x7E\u{80}-\u{10FFFF}]|\\[\x20-\x7E])*")
At the deepest level there are two parts joined by the |
operator: [\x20-\x21\x23-\x5B\x5D-\x7E\u{80}-\u{10FFFF}]
, which represents qtextSMTP
and \\[\x20-\x7E]
, which represents quoted-pairSMTP
. These are all inside a capture group with a Kleene star (*
) after it to match both types of QcontentSMTP
that repeats any number of times inside of double-quotes.
This section will match the bold portions of the following examples:
- " "@example.org
- "john..doe"@example.org
There are multiple types of address literals, but they all appear inside of brackets ([]
). This can be reflected by putting in escaped bracket characters around a capture group with |
between another level of nested capture groups for each different type of address literal.
(?<addressLiteral>\[((?<IPv4>▒▒▒)|(?<IPv6Full>▒▒▒)|(?<IPv6Comp>▒▒▒)|(?<IPv6v4Full>▒▒▒)|(?<IPv6v4Comp>▒▒▒)|(?<generalAddressLiteral>▒▒▒))\])
IPv4 literals are made with 4 sequences of 1 to 3 digits, joined by periods. This can be reflected in the regex as follows:
(?<IPv4>\d{1,3}(\.\d{1,3}){3})
This will match the bold portion of the following email address.
- postmaster@[123.123.123.123]
There are multiple forms of IPv6 literals. The first, an unabbreviated IPv6 address is can be matched by the following regex:
(?<IPv6Full>IPv6:[0-9a-f]{1,4}(:[0-9a-f]{1,4}){7})
IPv6 with sections of zeroes abbreviated to :: (per the rule IPv6-comp = [IPv6-hex *5(":" IPv6-hex)] "::" [IPv6-hex *5(":" IPv6-hex)]
). This can be matched in regex as follows:
(?<IPv6Comp>IPv6:([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,5})?::([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,5})?)
Then there are two forms of IPv6 addresses that end in IPv4 addresses. Those can be matched with the following two patterns.
(?<IPv6v4Full>IPv6:[0-9a-f]{1,4}(:[0-9a-f]{1,4}){5}:\d{1,3}(\.\d{1,3}){3})
(?<IPv6v4Comp>IPv6:([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,3})?::([0-9a-f]{1,4}(:[0-9a-f]{1,4}){0,3}:)?\d{1,3}(\.\d{1,3}){3})
These subpatterns will match the bold portions of the following email addresses.
- postmaster@[IPv6:2001:0db8:85a3:0000:0000:8a2e:0370:7334]
- postmaster@[IPv6:2001:0db8:85a3::8a2e:0370:7334]
- postmaster@[IPv6:2001:0db8:85a3:0000:0000:8a2e:123.123.123.123]
- postmaster@[IPv6:2001:0db8:85a3::8a2e:123.123.123.123]
General address literals are defined in the email RFC specifications as follows:
General-address-literal = Standardized-tag ":" 1*dcontent
Standardized-tag = Ldh-str
; Standardized-tag MUST be specified in a
; Standards-Track RFC and registered with IANA
dcontent = %d33-90 / ; Printable US-ASCII
%d94-126 ; excl. "[", "\", "]"
Let-dig = ALPHA / DIGIT
Ldh-str = *( ALPHA / DIGIT / "-" ) Let-dig
These grammatical rules can be reflected in regex as follows:
(?<generalAddressLiteral>[a-z0-9\-]*[[a-z0-9]:[\x21-\x5A\x5E-\x7E]+)
Note that this does not enforce the constraint that "Standardized-tag MUST be specified in a Standards-Track RFC and registered with IANA".
This subpattern would match the bold portion of the following, even if it is not standardized or recognized by IANA.
- postmaster@[abc:a]
The rules for domains are:
Domain = sub-domain *("." sub-domain)
sub-domain = Let-dig [Ldh-str]
Let-dig = ALPHA / DIGIT
Ldh-str = *( ALPHA / DIGIT / "-" ) Let-dig
As I stated above, I will treat internationalization as allowing non-ASCII Unicode characters as extra alternatives where ALPHA
and DIGIT
appear above.
(?<Domain>(?!.{256,})(([0-9a-z\u{80}-\u{10FFFF}]([0-9a-z\-\u{80}-\u{10FFFF}]*[0-9a-z\u{80}-\u{10FFFF}])?))(\.([0-9a-z\u{80}-\u{10FFFF}]([0-9a-z\-\u{80}-\u{10FFFF}]*[0-9a-z\u{80}-\u{10FFFF}])?))*)
This pattern contains the subpattern ([0-9a-z\u{80}-\u{10FFFF}]([0-9a-z\-\u{80}-\u{10FFFF}]*[0-9a-z\u{80}-\u{10FFFF}])?)
twice. This subpattern represents the subdomain
from the RFC rules. Inside this subpattern, the class [0-9a-z\u{80}-\u{10FFFF}]
which match any ASCII alphanumeric character or non-ASCII Unicode character.
This subpattern would match the bold portions of the following email addresses.
- admin@mailserver1
- user%example.com@example.org
- медведь@с-балалайкой.рф
This regex is the most complex regex that I have ever written, and I have yet to use it in running code. I developed this regex mostly as an exercise to practice reading RFCs and following them as closely as possible. That said, I have tested this pattern on the examples from the Wikipedia article on email addresses. See my regex101.com post for those test results.
I am somewhat concerned that the complexity of this regex pattern could cause the parsing engine to slow down. It may be better to use a simpler regex like /(".+"|\S+)@\S+/ui
.
If you do use this regex in a project, please let me know how it performs for you.
This was written by Brian Baker. Feel free to use the pattern or subpatterns in this gist. If you have any comments, please leave them below.
There is a bug in the expression. The fragments "+-/" are interpreted as "a character between '+' and '/'. The dot '.' is between those two characters in ASCII, so the regex validates '[email protected]' as valid, but that's not a RFC 6531 valid email. The hyphen must be escaped. Also, for clarity, I recommend to escape the hyphen in other places: "a-z0-9-" and "0-9a-z-".