This is the point at which it's worth learning <i>actual</i> parsing tools, rather than just winging it with REs. REs are fine for tokenizing, but cannot handle recursion, and quickly become clumsy for patterns made of distinct sub-elements.<p>Once you sink deeper into that turing tarpit, you end up with monstrosities like this (<a href="http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html" rel="nofollow">http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html</a>), a RE for matching valid email addresses.
Why bother matching balanced parens? just match everything until nonencoded space.<p>These are valid urls, no?<p><pre><code> example.com/(
example.com/dkjflkj)sdkfj(/.
example.com/.
example.com//////:
example.com/anycharacters_in_any_order_as_long_as_certain_ones_are_encoded
</code></pre>
There's no way to tell if trailing punctuation is part of valid url or not. You can assume trailing punc is not and chop it off. Which should be correct 99.999 of the time. Similar with surrounding braces,brackets,parens. If you see one at start assume the one at end is not part of url.
What's the purpose of matching "www.", "www1.", "www2." … "www999."? For purposes of this regex, wouldn't the domain matching that follows be sufficient?