TechEcho

6 comments

rb2k_almost 10 years ago

I guess this fits in here:Once upon a time I wrote my thesis on building a web crawler. The (tiny) blog post with an embedded preview:<a href="http://blog.marc-seeger.de/2010/12/09/my-thesis-building-blocks-of-a-scalable-webcrawler/" rel="nofollow">http://blog.marc-seeger.de/2010/12/09/my-thesis-building-blo...</a>The PDF itself:<a href="http://blog.marc-seeger.de/assets/papers/thesis_seeger-building_blocks_of_a_scalable_webcrawler.pdf" rel="nofollow">http://blog.marc-seeger.de/assets/papers/thesis_seeger-build...</a>It's mostly a "this is what I learned and the things I had to take into consideration" with a few "this is how you identify a CMS" bits sprinkled into it. These days I would probably change a thing or two, but people told me it's still an entertaining read. (Not a native speaker though, so the English might have some stylistic kinks)

jordiburgosalmost 10 years ago

Part 2, is already there <a href="http://engineering.bloomreach.com/crawling-billions-of-pages-building-large-scale-crawling-cluster-part-2/" rel="nofollow">http://engineering.bloomreach.com/crawling-billions-of-pages...</a>

viraptoralmost 10 years ago

> The Windows operating system can dispatch different events to different window handlers so you can handle all asynchronous HTTP calls efficiently. For a very long time, people weren’t able to do this on Linux-based operating systems since the underlying socket library contained a potential bottleneck.What? select()'s biggest issue is if you have lots of idle connections, which shouldn't be an issue when crawling (you can send more requests while waiting for responses). epoll() is available since 2003. What bottlenecks?

评论 #9704352 未加载

评论 #9703608 未加载

krokooalmost 10 years ago

The challenges with crawling on a large scale still persist as is evident by bloomreach and many other companies building custom solutions because available open source tools cannot handle the scale of such products. SQLBot aims to solve this problem. Product a few weeks from launch. If any is interested: <a href="http://www.amisalabs.com/AmisaSQLBot.html" rel="nofollow">http://www.amisalabs.com/AmisaSQLBot.html</a>

exacubealmost 10 years ago

From part 2 of their article:> Currently, more than 60 percent of global internet traffic consists of requests from crawlers or some type of automated Web discovery system.Where is this number from and how accurate can you make it?

评论 #9704805 未加载

评论 #9703702 未加载

kaivialmost 10 years ago

I wish there were more articles about determining the frequency at which one page should be crawled. Some pages never change, some change multiple times per minute, and we do not want to crawl them all equally often.

Crawling Billions of Pages: Building Large Scale Crawling Cluster, part 1

6 comments

Crawling Billions of Pages: Building Large Scale Crawling Cluster, part 1

6 comments