TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

An Improved Liberal, Accurate Regex Pattern for Matching URLs

55 pointsby blazamosalmost 15 years ago

4 comments

silentbicyclealmost 15 years ago
This is the point at which it's worth learning <i>actual</i> parsing tools, rather than just winging it with REs. REs are fine for tokenizing, but cannot handle recursion, and quickly become clumsy for patterns made of distinct sub-elements.<p>Once you sink deeper into that turing tarpit, you end up with monstrosities like this (<a href="http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html" rel="nofollow">http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html</a>), a RE for matching valid email addresses.
评论 #1553635 未加载
评论 #1553691 未加载
评论 #1554146 未加载
评论 #1554381 未加载
njharmanalmost 15 years ago
Why bother matching balanced parens? just match everything until nonencoded space.<p>These are valid urls, no?<p><pre><code> example.com/( example.com/dkjflkj)sdkfj(/. example.com/. example.com//////: example.com/anycharacters_in_any_order_as_long_as_certain_ones_are_encoded </code></pre> There's no way to tell if trailing punctuation is part of valid url or not. You can assume trailing punc is not and chop it off. Which should be correct 99.999 of the time. Similar with surrounding braces,brackets,parens. If you see one at start assume the one at end is not part of url.
评论 #1555003 未加载
santryalmost 15 years ago
What's the purpose of matching "www.", "www1.", "www2." … "www999."? For purposes of this regex, wouldn't the domain matching that follows be sufficient?
评论 #1553560 未加载
评论 #1553632 未加载
elialmost 15 years ago
But what about my .mueseum domains?