TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Has anyone ever crawled over a billion pages? How much did it cost?

10 点作者 outpan超过 8 年前
I'm really curious to find out how much it'll cost to crawl a billion pages. Doesn't really matter if you used a SaaS solution or built your own crawler, any info would be really useful.

3 条评论

mtmail超过 8 年前
There&#x27;s a discussion about a 2 billion page crawl on the frontpage right now. <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=12486631" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=12486631</a><p>Here&#x27;s the author&#x27;s comment on hardware <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=12487003" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=12487003</a> and later he says it costs 300 Euro&#x2F;month to run the service.
评论 #12492763 未加载
AznHisoka超过 8 年前
I&#x27;ve crawled over a billion pages over a stretch of 3 years or so. Crawling is the easy task and just crawling a billion pages wouldn&#x27;t cost more than a few thousand a month. Add a couple more thousand for storing these pages in a search index and database.
评论 #12502243 未加载
评论 #12493434 未加载
cdnsteve超过 8 年前
I think it would be valuable to have an open dataset of a raw crawl index. It could be distributed via academic torrents or partner with a hosting provider.<p>The real innovation won&#x27;t be in crawling but in working on the index, filtering it, organizing it, trying sort algorithms and learning.<p>If this was available and gained popularity I could see competition in search again.