TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

In search of the perfect URL validation regex (2010)

152 pointsby Jonhooover 3 years ago

16 comments

likiumover 3 years ago
Even if you built a URL validation regex that follows rfc3986[1] and rfc3987[2], you will still get user bug reports because web browsers follow a different standard.<p>For example, &lt;<a href="http:&#x2F;&#x2F;example.com.&#x2F;" rel="nofollow">http:&#x2F;&#x2F;example.com.&#x2F;</a>&gt; , &lt;<a href="http:&#x2F;&#x2F;&#x2F;example.com&#x2F;" rel="nofollow">http:&#x2F;&#x2F;&#x2F;example.com&#x2F;</a>&gt; and &lt;<a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Space (punctuation)" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Space (punctuation)</a>&gt; are classified as invalid urls in the blog, but they are accepted in the browser.<p>As the creator of cURL puts it, there is no URL standard[3].<p>[1]: <a href="https:&#x2F;&#x2F;www.ietf.org&#x2F;rfc&#x2F;rfc3986.txt" rel="nofollow">https:&#x2F;&#x2F;www.ietf.org&#x2F;rfc&#x2F;rfc3986.txt</a><p>[2]: <a href="https:&#x2F;&#x2F;www.ietf.org&#x2F;rfc&#x2F;rfc3987.txt" rel="nofollow">https:&#x2F;&#x2F;www.ietf.org&#x2F;rfc&#x2F;rfc3987.txt</a><p>[3]: <a href="https:&#x2F;&#x2F;daniel.haxx.se&#x2F;blog&#x2F;2016&#x2F;05&#x2F;11&#x2F;my-url-isnt-your-url&#x2F;" rel="nofollow">https:&#x2F;&#x2F;daniel.haxx.se&#x2F;blog&#x2F;2016&#x2F;05&#x2F;11&#x2F;my-url-isnt-your-url&#x2F;</a>
评论 #28655965 未加载
评论 #28655612 未加载
评论 #28656200 未加载
maciejgrykaover 3 years ago
Using <a href="https:&#x2F;&#x2F;regex.help&#x2F;" rel="nofollow">https:&#x2F;&#x2F;regex.help&#x2F;</a>, I got this beauty which passes all the ones, which should pass. Obviously some room for improvement ;) But it works!<p><pre><code> ^(?:http(?:(?::&#x2F;&#x2F;(?:(?:(?:code\.google\.com&#x2F;events&#x2F;#&amp;product=browser|\-\.~_!\$&amp;&#x27;\(\)\*\+,;=:%40:80%2f::::::@ex\.com|foo\.(?:bar&#x2F;\?q=Test%20URL\-encoded%20stuff|com&#x2F;(?:\(something\)\?after=parens|unicode_\(\)_in_parens|b_(?:\(wiki\)(?:_blah)?#cite\-1|b(?:_\(wiki\)_\(again\)|&#x2F;))))|uid(?::password@ex\.com(?::8080)?&#x2F;|@ex\.com(?::8080)?&#x2F;)|www\.ex\.com&#x2F;wpstyle&#x2F;\?p=364|223\.255\.255\.254|उदाहरण\.परीक्षा|1(?:42\.42\.1\.1&#x2F;|337\.net)|مثال\.إختبار|df\.ws&#x2F;123|a\.b\-c\.de|\.ws&#x2F;䨹|⌘\.ws&#x2F;|例子\.测试|j\.mp)|142\.42\.1\.1:8080&#x2F;)|\.damowmow\.com&#x2F;)|s:&#x2F;&#x2F;(?:www\.ex\.com&#x2F;foo&#x2F;\?bar=baz&amp;inga=42&amp;quux|foo_bar\.ex\.com&#x2F;))|:&#x2F;&#x2F;(?:uid(?::password@ex\.com(?::8080)?|@ex\.com(?::8080)?)|foo\.com&#x2F;b_b(?:_\(wiki\))?|⌘\.ws))|ftp:&#x2F;&#x2F;foo\.bar&#x2F;baz)$ </code></pre> I had to replace some words with shorter ones to squeeze under 1000 char limit and there&#x27;s no way to provide negative examples right now. Something to fix!
评论 #28655622 未加载
评论 #28659366 未加载
dangover 3 years ago
Two past discussions, for the curious:<p><i>In search of the perfect URL validation regex</i> - <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=10019795" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=10019795</a> - Aug 2015 (77 comments)<p><i>In search of the perfect URL validation regex</i> - <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=7928968" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=7928968</a> - June 2014 (81 comments)
Sephrover 3 years ago
&gt; Assume that this regex will be used for a public URL shortener written in PHP, so URLs like <a href="http:&#x2F;&#x2F;localhost&#x2F;" rel="nofollow">http:&#x2F;&#x2F;localhost&#x2F;</a>, &#x2F;&#x2F;foo.bar&#x2F;, :&#x2F;&#x2F;foo.bar&#x2F;, data:text&#x2F;plain;charset=utf-8,OHAI and tel:+1234567890 shouldn’t pass (even though they’re technically valid)<p>At Transcend, we need to allow site owners to regulate any arbitrary network traffic, so our data flow input UI¹ was designed to detect all valid hosts (including local hosts, IDN, IPv6 literal addresses, etc) and URLs (host-relative, protocol-relative, and absolute). If the site owner inputs content that is not a valid host or URL, then we treat their input as a regex.<p>I came up with these simple utilities built on top of the URL interface standard² to detect all valid hosts &amp; URLs:<p>• isValidHost: <a href="https:&#x2F;&#x2F;gist.github.com&#x2F;eligrey&#x2F;6549ad0a635fa07749238911b42923da" rel="nofollow">https:&#x2F;&#x2F;gist.github.com&#x2F;eligrey&#x2F;6549ad0a635fa07749238911b429...</a><p>Example valid inputs:<p><pre><code> host.example はじめよう.みんな (IDN domain; xn--p8j9a0d9c9a.xn--q9jyb4c) [::1] (IPv6 address) 0xdeadbeef (IPv4 address; 222.173.190.239) 123.456 (IPv4 address; 123.0.1.200) 123456789 (IPv4 address; 7.91.205.21) localhost </code></pre> • isValidURL (and isValidAbsoluteURL): <a href="https:&#x2F;&#x2F;gist.github.com&#x2F;eligrey&#x2F;443d51fab55864005ffb3873204b877a" rel="nofollow">https:&#x2F;&#x2F;gist.github.com&#x2F;eligrey&#x2F;443d51fab55864005ffb3873204b...</a><p>Example valid inputs to isValidURL:<p><pre><code> https:&#x2F;&#x2F;absolute-url.example &#x2F;&#x2F;relative-protocol.example &#x2F;relative-path-example </code></pre> 1. <a href="https:&#x2F;&#x2F;docs.transcend.io&#x2F;docs&#x2F;configuring-data-flows" rel="nofollow">https:&#x2F;&#x2F;docs.transcend.io&#x2F;docs&#x2F;configuring-data-flows</a><p>2. <a href="https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;API&#x2F;URL" rel="nofollow">https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;API&#x2F;URL</a>
评论 #28655535 未加载
dmurrayover 3 years ago
&gt; I also don’t want to allow every possible technically valid URL — quite the opposite.<p>Well, that should make things a lot easier. What does he mean here? The rest of the text doesn&#x27;t make it clear to me, unless it&#x27;s meant to be &quot;every possibly valid HTTP, HTTPS, or FTP URL&quot; which isn&#x27;t exactly &quot;the opposite&quot;.
评论 #28655560 未加载
azalemethover 3 years ago
Honest question: there is a famous and very funny stack exchange answer on the topic of parsing html with a regex [1] that states that the problem is in general impossible and if if you find yourself doing this, something has gone wrong and you should re-evaluate your life choices &#x2F; pray to Cthulu.<p>So, does this apply to URLs? The fact that these regexes are....so huge...makes me think that something is fundamentally wrong. Are URLs describable in a Chomsky Type 3 grammar? Are they sufficiently regular that using a Regex is sensible? What do the actual browsers do?<p>[1] <a href="https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;1732348&#x2F;regex-match-open-tags-except-xhtml-self-contained-tags&#x2F;" rel="nofollow">https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;1732348&#x2F;regex-match-open...</a>
评论 #28657065 未加载
评论 #28656986 未加载
评论 #28657282 未加载
gregsadetskyover 3 years ago
I was just struggling with this -- specifically, our users&#x27; &quot;UX&quot; expectation that entering &quot;example.com&quot; should work when asked for their website URL.<p>Most URL validation rules&#x2F;regex&#x2F;librairies&#x2F;etc. reject &quot;example.com&quot;. However, if you head over to Stripe (for example), in the account settings, when asked for your company&#x27;s URL, Stripe will accept &quot;example.com&quot;, and assume &quot;<a href="http:&#x2F;&#x2F;" rel="nofollow">http:&#x2F;&#x2F;</a>&quot; as the prefix (which yes, can have its own problems)<p>What&#x27;s a good solution? I both want to validate URLs, but also let users enter &quot;example.com&quot;. But if I simply do<p><pre><code> if(validateURL(url)) { return true; } else if(validateURL(&quot;http:&#x2F;&#x2F;&quot; + url)) { return true; } else { return false; } </code></pre> i.e. validate the given URL, and as a fallback, try to validate &quot;<a href="http:&#x2F;&#x2F;" rel="nofollow">http:&#x2F;&#x2F;</a>&quot; + the given url, that opens the door to weird, non-URLs strings being incorrectly validated...<p>Help :-)
评论 #28655756 未加载
评论 #28655593 未加载
评论 #28655483 未加载
评论 #28655452 未加载
dmixover 3 years ago
@stephenhay seems to be the winner here if you don&#x27;t need IP addresses (or weird dashed URLS). It&#x27;s only 38 characters long and easy to understand.<p><pre><code> @^(https?|ftp):&#x2F;&#x2F;[^\s&#x2F;$.?#].[^\s]*$@iS </code></pre> The simpler the better, if you&#x27;re going to use something that is not ideal.
评论 #28655705 未加载
jaboover 3 years ago
Tangentially related, but mentioning to hopefully save someone time: if you ever find yourself wanting to check if a version string is semver or not, before inventing your own, there is an official regex that’s provided.<p>I just discovered this yesterday and I’m glad I didn’t have to come up with this:<p><a href="https:&#x2F;&#x2F;semver.org&#x2F;#is-there-a-suggested-regular-expression-regex-to-check-a-semver-string" rel="nofollow">https:&#x2F;&#x2F;semver.org&#x2F;#is-there-a-suggested-regular-expression-...</a><p>My use case for it: <a href="https:&#x2F;&#x2F;github.com&#x2F;typesense&#x2F;typesense-website&#x2F;blob&#x2F;25562d0297cdb2444417deb29c321a6779c5d4ec&#x2F;docs-site&#x2F;content&#x2F;.vuepress&#x2F;plugins&#x2F;enhanceApp.js#L15" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;typesense&#x2F;typesense-website&#x2F;blob&#x2F;25562d02...</a>
Thorrezover 3 years ago
The rules here don&#x27;t make sense to me. <a href="http:&#x2F;&#x2F;223.255.255.254" rel="nofollow">http:&#x2F;&#x2F;223.255.255.254</a> must be allowed and <a href="http:&#x2F;&#x2F;10.1.1.1" rel="nofollow">http:&#x2F;&#x2F;10.1.1.1</a> must not. This is to provide security for the 10.0.0.0&#x2F;8 range? This doesn&#x27;t do that, because foo.com could resolve to 10.1.1.1 .
mpegover 3 years ago
I was once failed on a technical interview, partly because on the coding test I was asked to write a url parser &quot;from scratch, the way a browser would do it&quot; and I explained it would take way too long to account for every edge case in the URL RFC but that I could do a quick and dirty approach for common urls.<p>After I did this, the interviewer stopped me and told me in a negative way that he expected me to use a regex, which kinda shows he had no idea how a web browser works.
评论 #28654949 未加载
评论 #28655927 未加载
评论 #28655648 未加载
评论 #28656060 未加载
评论 #28654990 未加载
评论 #28655012 未加载
gibsonf1over 3 years ago
I use this:<p>u.checkURL = function (string) {<p><pre><code> if ($.type(string) === &quot;string&quot;) { if (&#x2F;^(https?|ftp):(\&#x2F;\&#x2F;(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&amp;&#x27;\(\)\*\+,;=]|:)*@)?((\[(|(v[\da-f]{1,}\.(([a-z]|\d|-|\.|_|~)|[!\$&amp;&#x27;\(\)\*\+,;=]|:)+))\])|((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&amp;&#x27;\(\)\*\+,;=])*)(:\d*)?)(\&#x2F;(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&amp;&#x27;\(\)\*\+,;=]|:|@)*)*|(\&#x2F;((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&amp;&#x27;\(\)\*\+,;=]|:|@)+(\&#x2F;(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&amp;&#x27;\(\)\*\+,;=]|:|@)*)*)?)|((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&amp;&#x27;\(\)\*\+,;=]|:|@)+(\&#x2F;(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&amp;&#x27;\(\)\*\+,;=]|:|@)*)*)|((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&amp;&#x27;\(\)\*\+,;=]|:|@)){0})(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&amp;&#x27;\(\)\*\+,;=]|:|@)|[\uE000-\uF8FF]|\&#x2F;|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&amp;&#x27;\(\)\*\+,;=]|:|@)|\&#x2F;|\?)\*)?$&#x2F;i.test(string)) { return true; } else { return false; } } else { return false; } }</code></pre>
评论 #28657879 未加载
评论 #28659668 未加载
DoctorOWover 3 years ago
Here&#x27;s the least imperfect Regex with Unit Tests on Regex101: <a href="https:&#x2F;&#x2F;regex101.com&#x2F;r&#x2F;IqI7KW&#x2F;2" rel="nofollow">https:&#x2F;&#x2F;regex101.com&#x2F;r&#x2F;IqI7KW&#x2F;2</a>
yleeover 3 years ago
Can someone convert diegoperini&#x27;s regex into a form compatible with Emacs Lisp? I freely admit to this being beyond my brainpower.
0xdeadb00fover 3 years ago
No validation of hex encoded IPv4? Or did I just miss it on my quick scroll through.
queuebertover 3 years ago
Uh oh, Regex is approaching sentience.
评论 #28656057 未加载