In search of the perfect URL validation regex (2010)

152 pointsby Jonhooover 3 years ago

16 comments

likiumover 3 years ago

Even if you built a URL validation regex that follows rfc3986[1] and rfc3987[2], you will still get user bug reports because web browsers follow a different standard.For example, <<a href="http://example.com./" rel="nofollow">http://example.com./</a>> , <<a href="http:///example.com/" rel="nofollow">http:///example.com/</a>> and <<a href="https://en.wikipedia.org/wiki/Space (punctuation)" rel="nofollow">https://en.wikipedia.org/wiki/Space (punctuation)</a>> are classified as invalid urls in the blog, but they are accepted in the browser.As the creator of cURL puts it, there is no URL standard[3].[1]: <a href="https://www.ietf.org/rfc/rfc3986.txt" rel="nofollow">https://www.ietf.org/rfc/rfc3986.txt</a>[2]: <a href="https://www.ietf.org/rfc/rfc3987.txt" rel="nofollow">https://www.ietf.org/rfc/rfc3987.txt</a>[3]: <a href="https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/" rel="nofollow">https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/</a>

评论 #28655965 未加载

评论 #28655612 未加载

评论 #28656200 未加载

maciejgrykaover 3 years ago

评论 #28655622 未加载

评论 #28659366 未加载

dangover 3 years ago

Two past discussions, for the curious:In search of the perfect URL validation regex - <a href="https://news.ycombinator.com/item?id=10019795" rel="nofollow">https://news.ycombinator.com/item?id=10019795</a> - Aug 2015 (77 comments)In search of the perfect URL validation regex - <a href="https://news.ycombinator.com/item?id=7928968" rel="nofollow">https://news.ycombinator.com/item?id=7928968</a> - June 2014 (81 comments)

Sephrover 3 years ago

> Assume that this regex will be used for a public URL shortener written in PHP, so URLs like <a href="http://localhost/" rel="nofollow">http://localhost/</a>, //foo.bar/, ://foo.bar/, data:text/plain;charset=utf-8,OHAI and tel:+1234567890 shouldn’t pass (even though they’re technically valid)At Transcend, we need to allow site owners to regulate any arbitrary network traffic, so our data flow input UI¹ was designed to detect all valid hosts (including local hosts, IDN, IPv6 literal addresses, etc) and URLs (host-relative, protocol-relative, and absolute). If the site owner inputs content that is not a valid host or URL, then we treat their input as a regex.I came up with these simple utilities built on top of the URL interface standard² to detect all valid hosts & URLs:• isValidHost: <a href="https://gist.github.com/eligrey/6549ad0a635fa07749238911b42923da" rel="nofollow">https://gist.github.com/eligrey/6549ad0a635fa07749238911b429...</a>Example valid inputs:<pre><code> host.example はじめよう.みんな (IDN domain; xn--p8j9a0d9c9a.xn--q9jyb4c) [::1] (IPv6 address) 0xdeadbeef (IPv4 address; 222.173.190.239) 123.456 (IPv4 address; 123.0.1.200) 123456789 (IPv4 address; 7.91.205.21) localhost </code></pre> • isValidURL (and isValidAbsoluteURL): <a href="https://gist.github.com/eligrey/443d51fab55864005ffb3873204b877a" rel="nofollow">https://gist.github.com/eligrey/443d51fab55864005ffb3873204b...</a>Example valid inputs to isValidURL:<pre><code> https://absolute-url.example //relative-protocol.example /relative-path-example </code></pre> 1. <a href="https://docs.transcend.io/docs/configuring-data-flows" rel="nofollow">https://docs.transcend.io/docs/configuring-data-flows</a>2. <a href="https://developer.mozilla.org/en-US/docs/Web/API/URL" rel="nofollow">https://developer.mozilla.org/en-US/docs/Web/API/URL</a>

评论 #28655535 未加载

dmurrayover 3 years ago

> I also don’t want to allow every possible technically valid URL — quite the opposite.Well, that should make things a lot easier. What does he mean here? The rest of the text doesn't make it clear to me, unless it's meant to be "every possibly valid HTTP, HTTPS, or FTP URL" which isn't exactly "the opposite".

评论 #28655560 未加载

azalemethover 3 years ago

Honest question: there is a famous and very funny stack exchange answer on the topic of parsing html with a regex [1] that states that the problem is in general impossible and if if you find yourself doing this, something has gone wrong and you should re-evaluate your life choices / pray to Cthulu.So, does this apply to URLs? The fact that these regexes are....so huge...makes me think that something is fundamentally wrong. Are URLs describable in a Chomsky Type 3 grammar? Are they sufficiently regular that using a Regex is sensible? What do the actual browsers do?[1] <a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/" rel="nofollow">https://stackoverflow.com/questions/1732348/regex-match-open...</a>

评论 #28657065 未加载

评论 #28656986 未加载

评论 #28657282 未加载

gregsadetskyover 3 years ago

I was just struggling with this -- specifically, our users' "UX" expectation that entering "example.com" should work when asked for their website URL.Most URL validation rules/regex/librairies/etc. reject "example.com". However, if you head over to Stripe (for example), in the account settings, when asked for your company's URL, Stripe will accept "example.com", and assume "<a href="http://" rel="nofollow">http://</a>" as the prefix (which yes, can have its own problems)What's a good solution? I both want to validate URLs, but also let users enter "example.com". But if I simply do<pre><code> if(validateURL(url)) { return true; } else if(validateURL("http://" + url)) { return true; } else { return false; } </code></pre> i.e. validate the given URL, and as a fallback, try to validate "<a href="http://" rel="nofollow">http://</a>" + the given url, that opens the door to weird, non-URLs strings being incorrectly validated...Help :-)

评论 #28655756 未加载

评论 #28655593 未加载

评论 #28655483 未加载

评论 #28655452 未加载

dmixover 3 years ago

@stephenhay seems to be the winner here if you don't need IP addresses (or weird dashed URLS). It's only 38 characters long and easy to understand.<pre><code> @^(https?|ftp)://[^\s/$.?#].[^\s]*$@iS </code></pre> The simpler the better, if you're going to use something that is not ideal.

评论 #28655705 未加载

jaboover 3 years ago

Tangentially related, but mentioning to hopefully save someone time: if you ever find yourself wanting to check if a version string is semver or not, before inventing your own, there is an official regex that’s provided.I just discovered this yesterday and I’m glad I didn’t have to come up with this:<a href="https://semver.org/#is-there-a-suggested-regular-expression-regex-to-check-a-semver-string" rel="nofollow">https://semver.org/#is-there-a-suggested-regular-expression-...</a>My use case for it: <a href="https://github.com/typesense/typesense-website/blob/25562d0297cdb2444417deb29c321a6779c5d4ec/docs-site/content/.vuepress/plugins/enhanceApp.js#L15" rel="nofollow">https://github.com/typesense/typesense-website/blob/25562d02...</a>

Thorrezover 3 years ago

The rules here don't make sense to me. <a href="http://223.255.255.254" rel="nofollow">http://223.255.255.254</a> must be allowed and <a href="http://10.1.1.1" rel="nofollow">http://10.1.1.1</a> must not. This is to provide security for the 10.0.0.0/8 range? This doesn't do that, because foo.com could resolve to 10.1.1.1 .

mpegover 3 years ago

I was once failed on a technical interview, partly because on the coding test I was asked to write a url parser "from scratch, the way a browser would do it" and I explained it would take way too long to account for every edge case in the URL RFC but that I could do a quick and dirty approach for common urls.After I did this, the interviewer stopped me and told me in a negative way that he expected me to use a regex, which kinda shows he had no idea how a web browser works.

评论 #28654949 未加载

评论 #28655927 未加载

评论 #28655648 未加载

评论 #28656060 未加载

评论 #28654990 未加载

评论 #28655012 未加载

gibsonf1over 3 years ago

I use this:u.checkURL = function (string) {<pre><code> if ($.type(string) === "string") { if (/^(https?|ftp):(\/\/(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\*\+,;=]|:)*@)?((\[(|(v[\da-f]{1,}\.(([a-z]|\d|-|\.|_|~)|[!\$&'\*\+,;=]|:)+))\])|((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\*\+,;=])*)(:\d*)?)(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\*\+,;=]|:|@)*)*|(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\*\+,;=]|:|@)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\*\+,;=]|:|@)*)*)?)|((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\*\+,;=]|:|@)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\*\+,;=]|:|@)*)*)|((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\*\+,;=]|:|@)){0})(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\*\+,;=]|:|@)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\*\+,;=]|:|@)|\/|\?)\*)?$/i.test(string)) { return true; } else { return false; } } else { return false; } }</code></pre>

评论 #28657879 未加载

评论 #28659668 未加载

DoctorOWover 3 years ago

Here's the least imperfect Regex with Unit Tests on Regex101: <a href="https://regex101.com/r/IqI7KW/2" rel="nofollow">https://regex101.com/r/IqI7KW/2</a>

yleeover 3 years ago

Can someone convert diegoperini's regex into a form compatible with Emacs Lisp? I freely admit to this being beyond my brainpower.

0xdeadb00fover 3 years ago

No validation of hex encoded IPv4? Or did I just miss it on my quick scroll through.

queuebertover 3 years ago

Uh oh, Regex is approaching sentience.

评论 #28656057 未加载