This concept been covered on Hacker News so many times before. :(
During a similar conversation 70 days ago, I left a detailed comment that is also relevant now regarding how RFC822 is actually totally irrelevant for the concept of e-mail addresses: what it specifies is how to escape the field values in MIME headers, and thereby has a bunch of rules for how to format an e-mail address that are really "how to embed an e-mail address in a MIME document".
RFC821, the SMTP specification for how you actually send e-mail, is closer, but has different rules about what is allowed because SMTP isn't MIME. A couple things aren't allowed, and some other things now are allowed and don't need to be escaped. Why people think users should type e-mail addresses in RFC822 escaping and not RFC821 escaping makes no sense to me.
However, the real punchline is: why are you asking users to enter e-mail addresses escaped at all? If you have an HTML form, for example, you don't need to escape them, as there is no higher-level protocol in which they are being embedded: the box can contain any characters that are needed, and there are no concepts like MIME comments, etc..
Asking a user to escape their e-mail address in that box is as silly as asking them to escape their username or password according to HTML or URL or some other escaping rules. Or, imagine if they had to enter their full name, but escaped using MIME encoded words... =?iso-8859-1?Q?=A1Hola,_se=F1or!?= makes about as much sense as escaping your e-mail address.
My original comment, which contains many more details about which specific RFCs are involved and what they mean, along with specific examples where things can get different, and a discussion of the context, here:
As others have said, the only way to know if an email address is valid is to try and send an email. This doesn't mean that this is useless, as you may want to get users to double-check their input if it doesn't pass this.
That library is very good for telling you exactly what's wrong with an email address, but looking through the logic, you can see that handling all the edge cases involves significant effort.
It seems to me that this monomaniacal focus on RFC 822 / RFC 5??? compliance is missing the point a bit. That stuff is important if you're writing a mail client or a server, not so much for a signup form or web site use.
I am going to stick my neck out and say that for the purposes of a signup form, emails should be server-side validated as: anyCharButNullOrAt@valid-looking-domain.com
1. You split the email address into a local part and a domain, at the @.
2. The local part is allowed to contain anything but null, or @. This will keep all the people who go john.doe+furrylist@gmail.com happy.
3. The domain section is validated against length and dns charset (letters, numbers, hyphens, dots) and then checked against e.g. the mozilla public suffix list for a valid TLD. This prevents admin@[10.0.0.2] from receiving signup emails.
4. After that you can punt to your mailer, as long as you do not give any feedback to the users (aside from reciept or non-receipt of email) on the success of delivery or verification (to prevent spam / bruting). Just say "an email has been sent", or somesuch.
Anyone with two @s in their email address ("foo@bar"@bar.com) or (foo\@bar@bar.com) is evil, and should not be encouraged. You have to draw the line somewhere, and that's where I draw it. One reason these people are not worth spending time on is because many common clients such as gmail won't let you send to such addresses in any case. Try it...
Let's thank our stars we don't have to handle Unicode addresses yet...
It performs regex validation, and if that passes, tries to get the SMTP server to validate the user's presence. Quite useful for when a user might fat-finger their username in the address.
So, for compliant mail servers, sending an email without verifying receipt via some confirmation token is no more reliable than this method (if it will falsely validate the user, it will probably falsely digest the message as well).
Be careful as validating email addresses like that will quickly get you grey or blacklisted by various spam cops. The method works well for a few addresses, but you can't use it to validate thousands of addresses at the same time. I learnt that the hard way. :)
@work we have done a lot of this kind of validations using a custom Tcl script, and luckily we have avoided to be blacklisted till now. Maybe it's due to the fact that validation requests come from an IP address marked as MX for the domain of the sender email address.
Anyway not all SMTP servers are RFC compliant, they respond with a 2XX code to a "RCPT TO" even if user/mailbox doesn't exists. The same thing applies also if the SMTP server is acting as a relay and has no immediate access to the delivering system.
Could you expand on this? Why will it get you black listed? What kind of situation are we talking about? It would be pretty annoying to get black listed.
I imagine a spammer could generate plausible emails and then check them against SMTP servers to discover the valid ones if they didn't get blacklisted.
I check if there's indeed a mail server in DNS to avoid the hassle of waiting for potentially slow SMTP servers to respond - for most cases, it works and I end up catching bogus email addresses from the domain names themselves.
Note that that library can also check for an MX server. It goes Valid string -> MX exists -> User exists (you can choose to only check for MX, or only for valid string).
Sending an email is the only way to know whether an address is working for a user. Validating whether it meets the standards that define how email addresses should look is a different problem, which is what the regex is going after.
RFC 822 (RFC 5322 in its more recent incarnation) refers to the From: header in the email, RFC 5321 refers to the address used in the 'MAIL FROM:' (and RCPT) SMTP command.
I am one. The regexp performs better than the parser, and works in my existing perl code, and has a lot of visibility now, so it's a no brainer for me to use it.
i'm building a simple offline webapp for collecting email addresses at an event on an ipad with no internet connection.
im using a simple regex (may use this one instead) to validate email before sticking in localStorage for latter retrial... if not with regex, how should i validate the email addresses?
If you really feel you need some kind of validation, have two email fields so the user can enter it twice and double check it themselves. No matter how much effort you put into it, and how complicated your validation code is, if the users want to mess with you, they can always just enter a valid fake address, so there's no point wasting a lot of energy on it because it's easy to defeat anyway.
And since it's impossible to validate addresses with a regular expression, there's a small chance you'll reject a valid address and look dumb.
I see the two fields thing a lot. It's pointless since I can copy and paste into the second field. Unless they use JavaScript to disable pasting, which is obnoxious.
They hint that they use this huge regex instead of a parser for performance reasons. At any rate, the regex was not written by hand; it is a concatenation of simpler, easier to understand regexes.
Why not save everything they enter and then validate later. I'd rather get bad email addresses while letting everything in than lose valid email addresses but block bad input.
Same here, and personally I don't see the justification for spending all those CPU cycles going through a massive regular expression such as this one.
I'd rather put this on the client side (javascript), as a validation to make sure the user doesn't supply an invalid e-mail address by accident (i.e. for his own convenience and nothing else).
compared to pushing the response back out to the client, the cost of matching against that regex is going to be insignificant, even with it being as monstrous as it is.
(note that I'm not saying using that regex is a good idea!)
actually you may want to make sure they have at least four characters separated by a dot, e.g. .\@\\.[..]+ ... and i think this is how the regex begins ...
my point though is that you can't send mail to a TLD, you need a domain name. and i don't think we have any one character TLDs.
this is quickly turning into an exercise where you see how such a regex starts to happen. "well, then you have to consider this case ... and handle these exceptions ... and then enforce this ..."
Fair call, I'm all for fewer arbitrary rules. Especially if it's less code.
I still consider the "oh, but it's valid to have dotless on RHS!" to be one of those facts which is true, but irrelevant.
Those three hypothetical users can't receive email sent from most major web providers (e.g. gmail, who don't allow dotless To:), can't sign up to most web sites (who get their validation wrong), and are at the mercy of pitiless local dns resolver rules (pope@va will go to pope@va.com for US users, a lot of the time).
Not only is it possible, when I used to work for a company that administered a TLD, I did just that, sending and receiving email with the address t@TLD.
the goal of client-side validation is to ensure that you can actually make that network call to do a real validation. the rfc is so complicated it's not even worth getting into this business, as evidenced by op's regex.
And with the new personalized TLDs, wouldn't you be able to have something like ceo@nike? I just check for an @ and at least a character after and before it.
Personally that pisses me off as it requires that I fully qualify all my local email addresses as what happens if I have the hostname 'nike' on my local net?
I found your comment fairly cryptic. I had a fun twenty minutes trying to work out what you meant, and under what circumstances dotless RHS in email addresses might be legal.
I suppose from the RFC, sure the spec doesn't require dots.
For example, I can use http://mythic-beasts.com/~pdw/cgi-bin/emailvalidate and verify that sure, '1@2!3!4' is a valid RFC822 email address. But I think e.g. UUCP-style addreses are a pathological case, and we don't _really_ want users signing up with them.
Another option would be intranets, e.g. 'baker@internal', but again I think that's being a bit pedantic, since most people on HN are writing webapps for the public Internet, not mail clients.
So can we get an email with foo@<some-dotless-string> routed across the public Internet? Even a bounce would do :)
You might be able to do a riff on xyzzy123@[23.55.211.36] (e.g xyzzy123@[389534500] or xyzzy123@[1737D324]. However, do you _really_ want your users to specify these?
There are mx records for existing TLDs (e.g. com, org, au, mx) - but all the mx records I tried refused connections on port 25. So no mail for 'xyzzy123@com' :(
So gTLDs are another option, and there was a time when it looked like xyzzy123@xyzzycorp might route (as long as it didn't collide with anything on the local resolver's search list). But it seems that dotless use of gTLDs is seriously deprecated at this point, and that ICANN will treat it as a TOS violation: http://domainincite.com/10254-why-domain-names-need-punctuat...
Basically, ICANN's conclusion was that dotless TLDs are a terrible idea for many technical reasons.
I looked into IDNs too, but of course due to the way DNS works, you can't really get around the dots.
So the conclusion of all this is that:
1) Using an RFC822 regex is a terrible way to check emails. The things it thinks are valid are MUCH wider than what you actually want.
2) You should probably check the RHS against a public suffix list if you are e.g. accepting a user email address on a signup page. If you accept dotless TLDs or other constructions (e.g. ips on RHS) there is some (low, but nonzero) risk that a malicious user could cause your systems to route mail to your other systems internally.
Email validation is indeed a complex and occasionally surprising beast.
Clearly this regex is impractical, but any validation you invent yourself is likely incorrect. The best way to validate email addresses remains sending an email to them.
Is it impractical, though? It's already been written, and I've not seen any suggestion that it's not correct (other than the problem with comments, but that's intrinsic to regexes in general). As long as you're going to follow up by actually sending an email, I don't see a problem with this as a first-pass filter.
+1, seeing as you're probably going to be sending an activation email anyway. You can do some practical checks, like checking that there's a '@' in the email, and probably trimming spaces (I think leading/trailing whitespace isn't allowed, from memory).
This is from "Mastering Regular Expressions", by Jeffrey Friedl, O'Reilly 1997. The book presents it as a 'fun' example of how to write huge regex'es that are still understandable and maintainable (the version posted here is without all the comments that are in the book).
E-mail validation can be useful, but I would stay away from this thing. Look at what you are trying to do from a higher level.
Most likely the user wants something from you as well as you from them. If a user gives you a bad e-mail, despite a very basic e-mail regex, whatever, they won't get an e-mail, not my problem.
If it is to register on your website, just let them, send them a confirmation e-mail to their 'email', meanwhile allowing them to use the system (or not). Then if after x-time they haven't confirmed, just delete the user again. This will save you a lot of trouble.
If you want something more high-tech like checking a huge list of e-mails in a system you could go with a solution suggested below, just send them an e-mail.
I'm surprised there isn't a Perl6 version of Mail::RFC822. This is exactly the kind of thing that Perl6 rules[1] are supposed to excel at. It would be good publicity, especially now that rakudo has usable releases.
I find it pretty entertaining that this RegEx is so big that visual patterns emerged. In my browser there are clear diagonal lines of "@" symbols across the RegEx. If it looks like ASCII art, your RegEx is probably too big.
Validation of emails is pretty pointless since most errors will be typos that pass the regex anyway. You're better off trying to give warning messages based on common typos.
Great, now I have something to strike fear into the hearts of new devs who ask me about email validation.
I'm not sure I'm comfortable using a regex like this in production. Sure, we can write lots of tests and ensure it performs correctly, and the rfc is unlikely to change so once proven solid it won't change... but using this just feels wrong. Like I'm using the dark side of the force.
I've been watching people deal with this problem for years and years... why? I can parse a CSV file far more easily.
You'd think there would be an RFC that specifies a simple email address format that everyone can follow. If you don't conform to that format, your email gets dropped on the floor until you get a better client.
It's also, unlike XHTML, not particularly easy to do it with a parser: most of the complexity of the regex is due to the litany of edge cases for what constitutes a valid email address, not due to it being a regex.
email address validation should not be motivated by what is valid address by some RFC, but what you feel confortable passing to your MTA, because you have exact understanding of what will happen. On the application side you probably don't want to store adresses with comment and real name fields and other such only human readble data. My rules are: contains exactly one @, contains zero or more +, does not contain any other characters that are special cased by this (notably ',', ';' and '!').
Because the domain name doesn't need to be fully-qualified; it can just be a machine name on the local network.
To illustrate this: "user@localhost" is a valid email address.
All these overly complex regular expressions miss a major point: even if the e-mail address is valid according to the RFC it doesn't guarantee that:
* The domain name exists.
* The user exists at the specified domain.
* All of the SMTP servers between you and the recipient adhere exactly to the RFC.
* The user actually owns or has access to the e-mail account in question.
Whenever I need to validate an e-mail address, I just use something simple like ".+@.+" to ensure sanity and move on to more pressing matters. As a friend once pointed out to me: it's usually far more damaging to reject valid e-mail addresses than to accept invalid ones; be liberal in what you accept and verify the e-mail address by sending them a confirmation mail.
Yes, especially websites should accept more than [a-zA-Z0-9] for the user part. This would allow filtering emails. E.g. gmails can tag emails this way: john.doe+spam@gmail.com
if the second blaat via DNS is resolvable, it will work fine.
A company I consulted at had a mail server that was internal only, and via their DNS server, they had resolvable names for department1, department2 etc...
They used to send messages to addresses like user@department1, user@department2 etc, and as each resolved fine and it worked very well.
And this is a problem I constantly run into with web services, some throw an error if I use a ., others throw an error if I try to do something like example+note@example.com.
I use one or the other to help sort emails.
It's even worse when a sign up form accepts an email in the latter format, but the login form does not for some reason. So I have an account with a note added but I cannot login. I had this problem with the Odeon website for a while, eventually had to phone them up and ask them to change my accounts email address to one without a note.
During a similar conversation 70 days ago, I left a detailed comment that is also relevant now regarding how RFC822 is actually totally irrelevant for the concept of e-mail addresses: what it specifies is how to escape the field values in MIME headers, and thereby has a bunch of rules for how to format an e-mail address that are really "how to embed an e-mail address in a MIME document".
RFC821, the SMTP specification for how you actually send e-mail, is closer, but has different rules about what is allowed because SMTP isn't MIME. A couple things aren't allowed, and some other things now are allowed and don't need to be escaped. Why people think users should type e-mail addresses in RFC822 escaping and not RFC821 escaping makes no sense to me.
However, the real punchline is: why are you asking users to enter e-mail addresses escaped at all? If you have an HTML form, for example, you don't need to escape them, as there is no higher-level protocol in which they are being embedded: the box can contain any characters that are needed, and there are no concepts like MIME comments, etc..
Asking a user to escape their e-mail address in that box is as silly as asking them to escape their username or password according to HTML or URL or some other escaping rules. Or, imagine if they had to enter their full name, but escaped using MIME encoded words... =?iso-8859-1?Q?=A1Hola,_se=F1or!?= makes about as much sense as escaping your e-mail address.
My original comment, which contains many more details about which specific RFCs are involved and what they mean, along with specific examples where things can get different, and a discussion of the context, here:
http://news.ycombinator.com/item?id=4486872