TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Crawling Billions of Pages: Building Large Scale Crawling Cluster, part 1

71 pointsby warrenmaralmost 10 years ago

6 comments

rb2k_almost 10 years ago
I guess this fits in here:<p>Once upon a time I wrote my thesis on building a web crawler. The (tiny) blog post with an embedded preview:<p><a href="http:&#x2F;&#x2F;blog.marc-seeger.de&#x2F;2010&#x2F;12&#x2F;09&#x2F;my-thesis-building-blocks-of-a-scalable-webcrawler&#x2F;" rel="nofollow">http:&#x2F;&#x2F;blog.marc-seeger.de&#x2F;2010&#x2F;12&#x2F;09&#x2F;my-thesis-building-blo...</a><p>The PDF itself:<p><a href="http:&#x2F;&#x2F;blog.marc-seeger.de&#x2F;assets&#x2F;papers&#x2F;thesis_seeger-building_blocks_of_a_scalable_webcrawler.pdf" rel="nofollow">http:&#x2F;&#x2F;blog.marc-seeger.de&#x2F;assets&#x2F;papers&#x2F;thesis_seeger-build...</a><p>It&#x27;s mostly a &quot;this is what I learned and the things I had to take into consideration&quot; with a few &quot;this is how you identify a CMS&quot; bits sprinkled into it. These days I would probably change a thing or two, but people told me it&#x27;s still an entertaining read. (Not a native speaker though, so the English might have some stylistic kinks)
jordiburgosalmost 10 years ago
Part 2, is already there <a href="http:&#x2F;&#x2F;engineering.bloomreach.com&#x2F;crawling-billions-of-pages-building-large-scale-crawling-cluster-part-2&#x2F;" rel="nofollow">http:&#x2F;&#x2F;engineering.bloomreach.com&#x2F;crawling-billions-of-pages...</a>
viraptoralmost 10 years ago
&gt; The Windows operating system can dispatch different events to different window handlers so you can handle all asynchronous HTTP calls efficiently. For a very long time, people weren’t able to do this on Linux-based operating systems since the underlying socket library contained a potential bottleneck.<p>What? select()&#x27;s biggest issue is if you have lots of idle connections, which shouldn&#x27;t be an issue when crawling (you can send more requests while waiting for responses). epoll() is available since 2003. What bottlenecks?
评论 #9704352 未加载
评论 #9703608 未加载
krokooalmost 10 years ago
The challenges with crawling on a large scale still persist as is evident by bloomreach and many other companies building custom solutions because available open source tools cannot handle the scale of such products. SQLBot aims to solve this problem. Product a few weeks from launch. If any is interested: <a href="http:&#x2F;&#x2F;www.amisalabs.com&#x2F;AmisaSQLBot.html" rel="nofollow">http:&#x2F;&#x2F;www.amisalabs.com&#x2F;AmisaSQLBot.html</a>
exacubealmost 10 years ago
From part 2 of their article:<p>&gt; Currently, more than 60 percent of global internet traffic consists of requests from crawlers or some type of automated Web discovery system.<p>Where is this number from and how accurate can you make it?
评论 #9704805 未加载
评论 #9703702 未加载
kaivialmost 10 years ago
I wish there were more articles about determining the frequency at which one page should be crawled. Some pages never change, some change multiple times per minute, and we do not want to crawl them all equally often.
评论 #9704833 未加载
评论 #9704904 未加载
评论 #9704746 未加载