This is by the same guys as the million short search engine (Google minus the top million). Probably good to use this in combination with the dns to find things that are not just broken links.<p><a href="http://www.millionshort.com/" rel="nofollow">http://www.millionshort.com/</a><p>The search engine was discussed on HN before:
<a href="http://news.ycombinator.com/item?id=3910304" rel="nofollow">http://news.ycombinator.com/item?id=3910304</a>
A great refinement would be: on the error page, suggest alternate sites with similar content that are still reachable.<p>Or even: for the exact URL visited, suggest the one page in the remaining long tail that's most like (by some text/semantic measure) the originally-requested page. (Or even: redirect automatically to that page.)
If you use these servers you may see a lot less advertising. That's because more than a few of the top 100/1000/10000/1000000 sites are actually just ad servers, assuming Million Short is using Alexa as the source. And because they appear in the top Alexa list one might guess those particular ad servers serve a significant share of the internet's advertising.<p>Another thought is you could potentially use these as general purpose DNS servers; e.g. they are all Amazon EC2 I believe so with respect to the DNS-based geolocation efforts of many websites, you'd be treated as if coming from the location of whatever region the datacenter is in. Just add the top 100/1000/10000/1000000 sites to your HOSTS file.
You can get into the top million on Alexa with a minuscule amount of traffic so you'd be extremely limited. Losing the top 1000 would probably be a more interesting experiment for mid/long term purposes.
What I think would be more interesting is a proxy that only uses the first 1k, 100k, 1m sites.<p>I might be wrong, but it might be an easy way to keep users on the "bright streets" of the Internet instead of wandering down malware-ridden alleys.
I can only imagine malicious uses for this. "Sorry, you're no longer allowed to access Google, Facebook, Twitter, or Wikipedia." Not that that is entirely a bad thing.
How did they do their ranking? Is it based on a web crawl, dns stats, other? Is their list of the top million domains public? I would love to see the data.