I'm really curious to find out how much it'll cost to crawl a billion pages. Doesn't really matter if you used a SaaS solution or built your own crawler, any info would be really useful.
There's a discussion about a 2 billion page crawl on the frontpage right now. <a href="https://news.ycombinator.com/item?id=12486631" rel="nofollow">https://news.ycombinator.com/item?id=12486631</a><p>Here's the author's comment on hardware
<a href="https://news.ycombinator.com/item?id=12487003" rel="nofollow">https://news.ycombinator.com/item?id=12487003</a> and later he says it costs 300 Euro/month to run the service.
I've crawled over a billion pages over a stretch of 3 years or so. Crawling is the easy task and just crawling a billion pages wouldn't cost more than a few thousand a month. Add a couple more thousand for storing these pages in a search index and database.
I think it would be valuable to have an open dataset of a raw crawl index. It could be distributed via academic torrents or partner with a hosting provider.<p>The real innovation won't be in crawling but in working on the index, filtering it, organizing it, trying sort algorithms and learning.<p>If this was available and gained popularity I could see competition in search again.