TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

How to crawl a quarter billion webpages in 40 hours (2012)

136 pointsby _ao789over 9 years ago

8 comments

secondtimeuseover 9 years ago
This was written in 2012, Its even easier these days by using SQS and Cloud Formation. 250 Million is a small number you are better of first going through Common Crawl and then use data from crawls to build a better seed list.<p>Common Crawl now contains repeated crawls conducted every few months and also urls donated by blekko.<p><a href="https:&#x2F;&#x2F;groups.google.com&#x2F;forum&#x2F;m&#x2F;#!msg&#x2F;common-crawl&#x2F;zexccXgwg4w&#x2F;oV8qeJnawJUJ" rel="nofollow">https:&#x2F;&#x2F;groups.google.com&#x2F;forum&#x2F;m&#x2F;#!msg&#x2F;common-crawl&#x2F;zexccXg...</a>
评论 #10867773 未加载
评论 #10867432 未加载
评论 #10867457 未加载
worried_citizenover 9 years ago
Web crawling is just like most things: 80% of the results for 20% of the work. It&#x27;s always the last mile that takes the most significant cost and engineering effort.<p>as your index and scale grow you bump into the really difficult problems:<p>1. How do you handle so many DNS requests&#x2F;sec without overloading upstream servers?<p>2. How do you discover and determine the quality of new links? It&#x27;s only a matter of time until your crawler hits a treasure trove of spam domains.<p>3. How do you store, update, and access an index that&#x27;s exponentially growing?<p>Just some ideas.
评论 #10867694 未加载
supernameover 9 years ago
No one ever talks about a particular topic though when it comes to web crawling etc. How do you avoid all the &quot;bad&quot; sites as in, really bad shit? The stuff that your ISP could use as evidence against you when in fact it was just your code running and it happened to come across one of these sorts of sites. How do you deal with all that? That is the only thing stopping me from experimenting around with web crawling.
评论 #10868662 未加载
pellaover 9 years ago
old HN comments ( 3 years ago ) <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=4367933" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=4367933</a>
tegansnyderover 9 years ago
I feel like more companies are building their businesses around web crawling and parsing data. There are lots of players in the eCommerce space that monitor pricing, search relevance, and product integrity. Each one of these companies has to build some sort of a templating system for defining crawl jobs, a definition of parsing rules to extract the data, and monitoring system to alert when the underlining HTML of a site has changed from their predefined rules. I&#x27;m interested in these aspects. Building a distributed crawler is easier than ever.
jdrockover 9 years ago
This isn&#x27;t particularly difficult anymore. The most interesting challenges in web crawling around turning a diaspora of web content into usable data. E.g., how to get prices from 10 million product listings from 1,000 different e-retailers?
packersvilleover 9 years ago
I don&#x27;t understand his hesitancy in releasing his crawler code. I imagine there are plenty for people to access and alter for malicious use if they desired, so why does releasing his such a big deal?
pbreitover 9 years ago
Is &quot;quarter billion&quot; used to make it sound like a bigger number? Even &quot;half&quot; is aggressive, imo.
评论 #10868280 未加载