Even if you built a URL validation regex that follows rfc3986[1] and rfc3987[2], you will still get user bug reports because web browsers follow a different standard.<p>For example, <<a href="http://example.com./" rel="nofollow">http://example.com./</a>> , <<a href="http:///example.com/" rel="nofollow">http:///example.com/</a>> and <<a href="https://en.wikipedia.org/wiki/Space (punctuation)" rel="nofollow">https://en.wikipedia.org/wiki/Space (punctuation)</a>> are classified as invalid urls in the blog, but they are accepted in the browser.<p>As the creator of cURL puts it, there is no URL standard[3].<p>[1]: <a href="https://www.ietf.org/rfc/rfc3986.txt" rel="nofollow">https://www.ietf.org/rfc/rfc3986.txt</a><p>[2]: <a href="https://www.ietf.org/rfc/rfc3987.txt" rel="nofollow">https://www.ietf.org/rfc/rfc3987.txt</a><p>[3]: <a href="https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/" rel="nofollow">https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/</a>
Using <a href="https://regex.help/" rel="nofollow">https://regex.help/</a>, I got this beauty which passes all the ones, which should pass. Obviously some room for improvement ;) But it works!<p><pre><code> ^(?:http(?:(?:://(?:(?:(?:code\.google\.com/events/#&product=browser|\-\.~_!\$&'\(\)\*\+,;=:%40:80%2f::::::@ex\.com|foo\.(?:bar/\?q=Test%20URL\-encoded%20stuff|com/(?:\(something\)\?after=parens|unicode_\(\)_in_parens|b_(?:\(wiki\)(?:_blah)?#cite\-1|b(?:_\(wiki\)_\(again\)|/))))|uid(?::password@ex\.com(?::8080)?/|@ex\.com(?::8080)?/)|www\.ex\.com/wpstyle/\?p=364|223\.255\.255\.254|उदाहरण\.परीक्षा|1(?:42\.42\.1\.1/|337\.net)|مثال\.إختبار|df\.ws/123|a\.b\-c\.de|\.ws/䨹|⌘\.ws/|例子\.测试|j\.mp)|142\.42\.1\.1:8080/)|\.damowmow\.com/)|s://(?:www\.ex\.com/foo/\?bar=baz&inga=42&quux|foo_bar\.ex\.com/))|://(?:uid(?::password@ex\.com(?::8080)?|@ex\.com(?::8080)?)|foo\.com/b_b(?:_\(wiki\))?|⌘\.ws))|ftp://foo\.bar/baz)$
</code></pre>
I had to replace some words with shorter ones to squeeze under 1000 char limit and there's no way to provide negative examples right now. Something to fix!
Two past discussions, for the curious:<p><i>In search of the perfect URL validation regex</i> - <a href="https://news.ycombinator.com/item?id=10019795" rel="nofollow">https://news.ycombinator.com/item?id=10019795</a> - Aug 2015 (77 comments)<p><i>In search of the perfect URL validation regex</i> - <a href="https://news.ycombinator.com/item?id=7928968" rel="nofollow">https://news.ycombinator.com/item?id=7928968</a> - June 2014 (81 comments)
> Assume that this regex will be used for a public URL shortener written in PHP, so URLs like <a href="http://localhost/" rel="nofollow">http://localhost/</a>, //foo.bar/, ://foo.bar/, data:text/plain;charset=utf-8,OHAI and tel:+1234567890 shouldn’t pass (even though they’re technically valid)<p>At Transcend, we need to allow site owners to regulate any arbitrary network traffic, so our data flow input UI¹ was designed to detect all valid hosts (including local hosts, IDN, IPv6 literal addresses, etc) and URLs (host-relative, protocol-relative, and absolute). If the site owner inputs content that is not a valid host or URL, then we treat their input as a regex.<p>I came up with these simple utilities built on top of the URL interface standard² to detect all valid hosts & URLs:<p>• isValidHost: <a href="https://gist.github.com/eligrey/6549ad0a635fa07749238911b42923da" rel="nofollow">https://gist.github.com/eligrey/6549ad0a635fa07749238911b429...</a><p>Example valid inputs:<p><pre><code> host.example
はじめよう.みんな (IDN domain; xn--p8j9a0d9c9a.xn--q9jyb4c)
[::1] (IPv6 address)
0xdeadbeef (IPv4 address; 222.173.190.239)
123.456 (IPv4 address; 123.0.1.200)
123456789 (IPv4 address; 7.91.205.21)
localhost
</code></pre>
• isValidURL (and isValidAbsoluteURL): <a href="https://gist.github.com/eligrey/443d51fab55864005ffb3873204b877a" rel="nofollow">https://gist.github.com/eligrey/443d51fab55864005ffb3873204b...</a><p>Example valid inputs to isValidURL:<p><pre><code> https://absolute-url.example
//relative-protocol.example
/relative-path-example
</code></pre>
1. <a href="https://docs.transcend.io/docs/configuring-data-flows" rel="nofollow">https://docs.transcend.io/docs/configuring-data-flows</a><p>2. <a href="https://developer.mozilla.org/en-US/docs/Web/API/URL" rel="nofollow">https://developer.mozilla.org/en-US/docs/Web/API/URL</a>
> I also don’t want to allow every possible technically valid URL — quite the opposite.<p>Well, that should make things a lot easier. What does he mean here? The rest of the text doesn't make it clear to me, unless it's meant to be "every possibly valid HTTP, HTTPS, or FTP URL" which isn't exactly "the opposite".
Honest question: there is a famous and very funny stack exchange answer on the topic of parsing html with a regex [1] that states that the problem is in general impossible and if if you find yourself doing this, something has gone wrong and you should re-evaluate your life choices / pray to Cthulu.<p>So, does this apply to URLs? The fact that these regexes are....so huge...makes me think that something is fundamentally wrong. Are URLs describable in a Chomsky Type 3 grammar? Are they sufficiently regular that using a Regex is sensible? What do the actual browsers do?<p>[1] <a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/" rel="nofollow">https://stackoverflow.com/questions/1732348/regex-match-open...</a>
I was just struggling with this -- specifically, our users' "UX" expectation that entering "example.com" should work when asked for their website URL.<p>Most URL validation rules/regex/librairies/etc. reject "example.com". However, if you head over to Stripe (for example), in the account settings, when asked for your company's URL, Stripe will accept "example.com", and assume "<a href="http://" rel="nofollow">http://</a>" as the prefix (which yes, can have its own problems)<p>What's a good solution? I both want to validate URLs, but also let users enter "example.com". But if I simply do<p><pre><code> if(validateURL(url)) {
return true;
} else if(validateURL("http://" + url)) {
return true;
} else {
return false;
}
</code></pre>
i.e. validate the given URL, and as a fallback, try to validate "<a href="http://" rel="nofollow">http://</a>" + the given url, that opens the door to weird, non-URLs strings being incorrectly validated...<p>Help :-)
@stephenhay seems to be the winner here if you don't need IP addresses (or weird dashed URLS). It's only 38 characters long and easy to understand.<p><pre><code> @^(https?|ftp)://[^\s/$.?#].[^\s]*$@iS
</code></pre>
The simpler the better, if you're going to use something that is not ideal.
Tangentially related, but mentioning to hopefully save someone time: if you ever find yourself wanting to check if a version string is semver or not, before inventing your own, there is an official regex that’s provided.<p>I just discovered this yesterday and I’m glad I didn’t have to come up with this:<p><a href="https://semver.org/#is-there-a-suggested-regular-expression-regex-to-check-a-semver-string" rel="nofollow">https://semver.org/#is-there-a-suggested-regular-expression-...</a><p>My use case for it: <a href="https://github.com/typesense/typesense-website/blob/25562d0297cdb2444417deb29c321a6779c5d4ec/docs-site/content/.vuepress/plugins/enhanceApp.js#L15" rel="nofollow">https://github.com/typesense/typesense-website/blob/25562d02...</a>
The rules here don't make sense to me. <a href="http://223.255.255.254" rel="nofollow">http://223.255.255.254</a> must be allowed and <a href="http://10.1.1.1" rel="nofollow">http://10.1.1.1</a> must not. This is to provide security for the 10.0.0.0/8 range? This doesn't do that, because foo.com could resolve to 10.1.1.1 .
I was once failed on a technical interview, partly because on the coding test I was asked to write a url parser "from scratch, the way a browser would do it" and I explained it would take way too long to account for every edge case in the URL RFC but that I could do a quick and dirty approach for common urls.<p>After I did this, the interviewer stopped me and told me in a negative way that he expected me to use a regex, which kinda shows he had no idea how a web browser works.
Here's the least imperfect Regex with Unit Tests on Regex101: <a href="https://regex101.com/r/IqI7KW/2" rel="nofollow">https://regex101.com/r/IqI7KW/2</a>