In search of the perfect URL validation regex

65 pointsby lgmspbalmost 11 years ago

22 comments

to3malmost 11 years ago

If you're going to allow dotted IPs you should really allow 32-bit IPs too, e.g., <a href="http://0xadc229b7" rel="nofollow">http://0xadc229b7</a>, <a href="http://2915183031" rel="nofollow">http://2915183031</a> and <a href="http://025560424667" rel="nofollow">http://025560424667</a>. (The validity of this last one was news to me I must admit.)

评论 #7929277 未加载

评论 #7929030 未加载

评论 #7929883 未加载

TazeTSchnitzelalmost 11 years ago

Why use a regex? It's much simpler to write a URL validator by hand, speaking as someone who wrote a URL parser,[1] and fixed a bug in PHP's.[2]Or, you know, use a robust existing validator or parser. Like PHP's, for instance.[1] <a href="https://github.com/TazeTSchnitzel/Faucet-HTTP-Extension" rel="nofollow">https://github.com/TazeTSchnitzel/Faucet-HTTP-Extension</a> - granted, this deliberately limits the space of URLs it can parse, but it's not difficult to cover all valid cases if you need to[2] <a href="https://github.com/php/php-src/commit/36b88d77f2a9d0ac74692a679f636ccb5d11589f" rel="nofollow">https://github.com/php/php-src/commit/36b88d77f2a9d0ac74692a...</a>

评论 #7929767 未加载

评论 #7930404 未加载

评论 #7929601 未加载

评论 #7930550 未加载

Dylan16807almost 11 years ago

Why are <a href="http://www.foo.bar./" rel="nofollow">http://www.foo.bar./</a> and <a href="http://a.b--c.de/" rel="nofollow">http://a.b--c.de/</a> supposed to fail?The @stephenhay is just about perfect despite being the shortest. The subtleties of hyphen placement aren't very important, and this is a dumb place to filter out private IP addresses when a domain could always resolve to one. Checking if an IP is valid should be a later step.

评论 #7930567 未加载

评论 #7929268 未加载

评论 #7929707 未加载

elialmost 11 years ago

At best this lets you conclude that a URL could be valid. Is that really useful? Is the goal here to catch typos? Because you'd still miss an awful lot of typos.If you really want your URL shortener to reject bad URLs, then you need to actually test fetching each URL (and even then...)As an aside, I'd instantly fail any library that validates against a list of known TLDs. That was a bad idea when people were doing it a decade ago. It's completely impractical now.

评论 #7930620 未加载

评论 #7929269 未加载

bdarnellalmost 11 years ago

Another important dimension when evaluating these regexes is performance. The Gruber v2 regex has exponential (?) behavior on certain pathological inputs (at least in the python re module).There are some examples of these pathological inputs at <a href="https://github.com/tornadoweb/tornado/blob/master/tornado/test/escape_test.py#L20-29" rel="nofollow">https://github.com/tornadoweb/tornado/blob/master/tornado/te...</a>

评论 #7929605 未加载

评论 #7929591 未加载

mdavidnalmost 11 years ago

Use a standard URI parser to break this problem into smaller parts. Let a modern URI library worry about arcane details like spaces, fragments, userinfo, IPv6 hosts, etc.<pre><code> uri = URI.parse(target).normalize uri.absolute? or raise 'URI not absolute' %w[ http https ftp ].include?(uri.scheme) or raise 'Unsupported URI scheme' # Etc</code></pre>

评论 #7930595 未加载

评论 #7929612 未加载

MatthewWilkesalmost 11 years ago

Why no IPv6 addresses in the test cases?

VaucGiapsalmost 11 years ago

Why not put in some of the new TLDs as test cases... ;)

eridiusalmost 11 years ago

John Gruber (of daringfireball.com) came up with a regex for extracting URLs from text (Twitter-like) years ago, and has improved it since. The current version is found at <a href="https://gist.github.com/gruber/249502" rel="nofollow">https://gist.github.com/gruber/249502</a>.I haven't tested it myself, but it's worth looking at.Original post: <a href="http://daringfireball.net/2009/11/liberal_regex_for_matching_urls" rel="nofollow">http://daringfireball.net/2009/11/liberal_regex_for_matching...</a>Updated version: <a href="http://daringfireball.net/2010/07/improved_regex_for_matching_urls" rel="nofollow">http://daringfireball.net/2010/07/improved_regex_for_matchin...</a>Most recent announcement, which contained the Gist URL: <a href="http://daringfireball.net/linked/2014/02/08/improved-improved-regex" rel="nofollow">http://daringfireball.net/linked/2014/02/08/improved-improve...</a>

评论 #7929251 未加载

Bugealmost 11 years ago

Interestingly it seems <a href="http://✪df.ws" rel="nofollow">http://✪df.ws</a> isn't actually valid, even though it exists. ✪ isn't a letter[1], so it isn't allowed in international domain names. I was looking at the latest RFC from 2010 [2] so maybe it was allowed before that. The owner talks about all the compatibility trouble he had after he registered it [3]. The registrar that he used for it, Dynadot, won't let me register any name with that character, nor will Namecheap.[1] <a href="http://www.fileformat.info/info/unicode/char/272a/index.htm" rel="nofollow">http://www.fileformat.info/info/unicode/char/272a/index.htm</a>[2] <a href="http://tools.ietf.org/html/rfc5892" rel="nofollow">http://tools.ietf.org/html/rfc5892</a>[3] <a href="http://daringfireball.net/2010/09/starstruck" rel="nofollow">http://daringfireball.net/2010/09/starstruck</a>

评论 #7930088 未加载

mnotalmost 11 years ago

There is no perfect URL validation regex, because there are so many things you can do with URLs, and so many contexts to use them with. So, it might be perfect for the OP, but completely inappropriate for you.That said, there is a regex in RFC3986, but that's for parsing a URI, not validating it.I converted 3986's ABNF to regex here: <a href="https://gist.github.com/mnot/138549" rel="nofollow">https://gist.github.com/mnot/138549</a>However, some of the test cases in the original post (the list of URLs there aren't available separately any more :( ) are IRIs, not URIs, so they fail; they need to be converted to URIs first.In the sense of the WHATWG's specs, what he's looking for are URLs, so this could be useful: <a href="http://url.spec.whatwg.org" rel="nofollow">http://url.spec.whatwg.org</a>However, I don't know of a regex that implements that, and there isn't any ABNF to convert from there.

siliconc0walmost 11 years ago

This is a good lesson why you want to avoid writing your own regexes. Even something simple like an email address can be insane:<a href="http://ex-parrot.com/~pdw/Mail-RFC822-Address.html" rel="nofollow">http://ex-parrot.com/~pdw/Mail-RFC822-Address.html</a>

评论 #7929587 未加载

lucb1ealmost 11 years ago

What's wrong with IP-address URLs? If they are invalid because it says so in some RFC, this is still not the ultimate regex. If you redirect a browser to <a href="http://192.168.1.1" rel="nofollow">http://192.168.1.1</a> it works perfectly fine.And why must the root period behind the domain be omitted from URLs? Not only does it work in a browser (and people end sentences with periods), the domain should actually end in a period all the time but it's usually omitted for ease of use. Only some DNS applications still require domains to end with root dots.

评论 #7929342 未加载

评论 #7929410 未加载

tshadwellalmost 11 years ago

I've put the test cases into a refiddle: <a href="http://refiddle.com/refiddles/53a736c175622d2770a70400" rel="nofollow">http://refiddle.com/refiddles/53a736c175622d2770a70400</a>

droopealmost 11 years ago

I just validate with this regex '^http' :P

JetSpiegelalmost 11 years ago

It has to match this valid URL: <a href="http://موقع.وزارة-الاتصالات.مصر" rel="nofollow">http://موقع.وزارة-الاتصالات.مصر</a>

cobaltalmost 11 years ago

What's wrong with /([\w-]+:\/\/[^\s]+)/giIt's not fancy but it will essentially match any url

评论 #7929891 未加载

Sir_Cmpwnalmost 11 years ago

When you have a hammer, everything looks like a nail.

CMCDragonkaialmost 11 years ago

What does the red vs green boxes mean?

评论 #7930161 未加载

timmmalmost 11 years ago

What flavor of regex are we do making this in?

zAy0LfpBZLC8mACalmost 11 years ago

WTF? When will people finally learn to read the spec and implement things based on the spec and test things based on the spec instead of just making up themselves what a URL is or what HTML is or what an email address is or what a MIME body is or ...There are supposed URIs in that list that aren't actually URIs, there are supposed non-URIs in that list that are actually URIs, and most of the candidate regexes obviously must have come from some creative minds and not from people who should be writing software. If you just make shit up instead of referring to what the spec says, you urgently should find yourself a new profession, this kind of crap has been hurting us long enough.(Also, I do not just mean the numeric RFC1918 IPv4 URIs, which obviously are valid URIs but have been rejected intentionally nonetheless - even though that's idiotic as well, of course, given that (a) nothing prevents anyone from putting those addresses in the DNS and (b) those are actually perfectly fine URIs that people use, and I don't see why people should not want to shorten some class of the URIs that they use.)By the way, the grammar in the RFC is machine readable, and it's regular. So you can just write a script that transforms that grammar into a regex that is guaranteed to reflect exactly what the spec says.

评论 #7929983 未加载

评论 #7930584 未加载

lazyloopalmost 11 years ago

You do realize that RFC 3986 actually contains an official regular expression, right? <a href="http://tools.ietf.org/html/rfc3986#appendix-B" rel="nofollow">http://tools.ietf.org/html/rfc3986#appendix-B</a>

评论 #7930623 未加载

评论 #7929630 未加载

评论 #7929636 未加载