If you're going to allow dotted IPs you should really allow 32-bit IPs too, e.g., <a href="http://0xadc229b7" rel="nofollow">http://0xadc229b7</a>, <a href="http://2915183031" rel="nofollow">http://2915183031</a> and <a href="http://025560424667" rel="nofollow">http://025560424667</a>. (The validity of this last one was news to me I must admit.)
Why use a regex? It's much simpler to write a URL validator by hand, speaking as someone who wrote a URL parser,[1] and fixed a bug in PHP's.[2]<p>Or, you know, use a robust existing validator or parser. Like PHP's, for instance.<p>[1] <a href="https://github.com/TazeTSchnitzel/Faucet-HTTP-Extension" rel="nofollow">https://github.com/TazeTSchnitzel/Faucet-HTTP-Extension</a> - granted, this deliberately limits the space of URLs it can parse, but it's not difficult to cover all valid cases if you need to<p>[2] <a href="https://github.com/php/php-src/commit/36b88d77f2a9d0ac74692a679f636ccb5d11589f" rel="nofollow">https://github.com/php/php-src/commit/36b88d77f2a9d0ac74692a...</a>
Why are <a href="http://www.foo.bar./" rel="nofollow">http://www.foo.bar./</a> and <a href="http://a.b--c.de/" rel="nofollow">http://a.b--c.de/</a> supposed to fail?<p>The @stephenhay is just about perfect despite being the shortest. The subtleties of hyphen placement aren't very important, and this is a dumb place to filter out private IP addresses when a domain could always resolve to one. Checking if an IP is valid should be a later step.
At best this lets you conclude that a URL <i>could</i> be valid. Is that really useful? Is the goal here to catch typos? Because you'd still miss an awful lot of typos.<p>If you really want your URL shortener to reject bad URLs, then you need to actually test fetching each URL (and even then...)<p>As an aside, I'd instantly fail any library that validates against a list of known TLDs. That was a bad idea when people were doing it a decade ago. It's completely impractical now.
Another important dimension when evaluating these regexes is performance. The Gruber v2 regex has exponential (?) behavior on certain pathological inputs (at least in the python re module).<p>There are some examples of these pathological inputs at
<a href="https://github.com/tornadoweb/tornado/blob/master/tornado/test/escape_test.py#L20-29" rel="nofollow">https://github.com/tornadoweb/tornado/blob/master/tornado/te...</a>
Use a standard URI parser to break this problem into smaller parts. Let a modern URI library worry about arcane details like spaces, fragments, userinfo, IPv6 hosts, etc.<p><pre><code> uri = URI.parse(target).normalize
uri.absolute? or raise 'URI not absolute'
%w[ http https ftp ].include?(uri.scheme) or raise 'Unsupported URI scheme'
# Etc</code></pre>
John Gruber (of daringfireball.com) came up with a regex for extracting URLs from text (Twitter-like) years ago, and has improved it since. The current version is found at <a href="https://gist.github.com/gruber/249502" rel="nofollow">https://gist.github.com/gruber/249502</a>.<p>I haven't tested it myself, but it's worth looking at.<p>Original post: <a href="http://daringfireball.net/2009/11/liberal_regex_for_matching_urls" rel="nofollow">http://daringfireball.net/2009/11/liberal_regex_for_matching...</a><p>Updated version: <a href="http://daringfireball.net/2010/07/improved_regex_for_matching_urls" rel="nofollow">http://daringfireball.net/2010/07/improved_regex_for_matchin...</a><p>Most recent announcement, which contained the Gist URL: <a href="http://daringfireball.net/linked/2014/02/08/improved-improved-regex" rel="nofollow">http://daringfireball.net/linked/2014/02/08/improved-improve...</a>
Interestingly it seems <a href="http://✪df.ws" rel="nofollow">http://✪df.ws</a> isn't actually valid, even though it exists. ✪ isn't a letter[1], so it isn't allowed in international domain names. I was looking at the latest RFC from 2010 [2] so maybe it was allowed before that. The owner talks about all the compatibility trouble he had after he registered it [3]. The registrar that he used for it, Dynadot, won't let me register any name with that character, nor will Namecheap.<p>[1] <a href="http://www.fileformat.info/info/unicode/char/272a/index.htm" rel="nofollow">http://www.fileformat.info/info/unicode/char/272a/index.htm</a><p>[2] <a href="http://tools.ietf.org/html/rfc5892" rel="nofollow">http://tools.ietf.org/html/rfc5892</a><p>[3] <a href="http://daringfireball.net/2010/09/starstruck" rel="nofollow">http://daringfireball.net/2010/09/starstruck</a>
There is no perfect URL validation regex, because there are so many things you can do with URLs, and so many contexts to use them with. So, it might be perfect for the OP, but completely inappropriate for you.<p>That said, there is a regex in RFC3986, but that's for parsing a URI, not validating it.<p>I converted 3986's ABNF to regex here:
<a href="https://gist.github.com/mnot/138549" rel="nofollow">https://gist.github.com/mnot/138549</a><p>However, some of the test cases in the original post (the list of URLs there aren't available separately any more :( ) are IRIs, not URIs, so they fail; they need to be converted to URIs first.<p>In the sense of the WHATWG's specs, what he's looking for <i>are</i> URLs, so this could be useful:
<a href="http://url.spec.whatwg.org" rel="nofollow">http://url.spec.whatwg.org</a><p>However, I don't know of a regex that implements that, and there isn't any ABNF to convert from there.
This is a good lesson why you want to avoid writing your own regexes. Even something simple like an email address can be insane:<a href="http://ex-parrot.com/~pdw/Mail-RFC822-Address.html" rel="nofollow">http://ex-parrot.com/~pdw/Mail-RFC822-Address.html</a>
What's wrong with IP-address URLs? If they are invalid because it says so in some RFC, this is still not the ultimate regex. If you redirect a browser to <a href="http://192.168.1.1" rel="nofollow">http://192.168.1.1</a> it works perfectly fine.<p>And why must the root period behind the domain be omitted from URLs? Not only does it work in a browser (and people end sentences with periods), the domain should actually end in a period all the time but it's usually omitted for ease of use. Only some DNS applications still require domains to end with root dots.
I've put the test cases into a refiddle: <a href="http://refiddle.com/refiddles/53a736c175622d2770a70400" rel="nofollow">http://refiddle.com/refiddles/53a736c175622d2770a70400</a>
WTF? When will people finally learn to read the spec and implement things based on the spec and test things based on the spec instead of just making up themselves what a URL is or what HTML is or what an email address is or what a MIME body is or ...<p>There are supposed URIs in that list that aren't actually URIs, there are supposed non-URIs in that list that are actually URIs, and most of the candidate regexes obviously must have come from some creative minds and not from people who should be writing software. If you just make shit up instead of referring to what the spec says, you urgently should find yourself a new profession, this kind of crap has been hurting us long enough.<p>(Also, I do not just mean the numeric RFC1918 IPv4 URIs, which obviously are valid URIs but have been rejected intentionally nonetheless - even though that's idiotic as well, of course, given that (a) nothing prevents anyone from putting those addresses in the DNS and (b) those are actually perfectly fine URIs that people use, and I don't see why people should not want to shorten some class of the URIs that they use.)<p>By the way, the grammar in the RFC is machine readable, and it's regular. So you can just write a script that transforms that grammar into a regex that is guaranteed to reflect exactly what the spec says.
You do realize that RFC 3986 actually contains an official regular expression, right? <a href="http://tools.ietf.org/html/rfc3986#appendix-B" rel="nofollow">http://tools.ietf.org/html/rfc3986#appendix-B</a>