TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Building blocks of a scalable webcrawler.

80 pointsby 0x44over 14 years ago

6 comments

shrikantover 14 years ago
IIRC, sriramk from around here (<a href="http://news.ycombinator.com/user?id=sriramk" rel="nofollow">http://news.ycombinator.com/user?id=sriramk</a>) had also 'rolled his own' web-crawler as a project in college about 5-6 (?) years back. He blogged about it fairly actively back then, and I really enjoyed following his journey (esp. when after months of dev and testing, he finally 'slipped it into the wild'). Tried to dredge up those posts, but he seems to have taken them down :( A shame really - they were quite a fascinating look at the early-stage evolution of a programmer!<p>Sriram, you around? ;)
评论 #2022796 未加载
rb2k_over 14 years ago
Uh, look what the cat dragged in: my thesis :)<p>Hope some of you enjoy the read, I'm open for comments and criticism
评论 #2022653 未加载
评论 #2022304 未加载
评论 #2022897 未加载
评论 #2022379 未加载
yesnoover 14 years ago
I like Ted Dziuba solution:<p><a href="http://teddziuba.com/2010/10/taco-bell-programming.html" rel="nofollow">http://teddziuba.com/2010/10/taco-bell-programming.html</a><p>Full-stack programmer at work!
评论 #2022657 未加载
inovicaover 14 years ago
A good read and very timely from my perspective. We created a crawler in Python a couple of years ago for RSS feeds, but we ran into a number of issues with it, so put it on hold as we concentrated on work that made money :) We started to look at the project last week and we've been looking at rolling our own versus looking at frameworks like Scrapy. The main thing for us is being able to scale. Anyone who has knowledge of creating a distributed crawler in Python I'd welcome some advice.<p>Thanks again. Really good post
评论 #2022442 未加载
评论 #2022586 未加载
richcollinsover 14 years ago
I'm having good luck using node.js's httpClient and vertex.js for crawl state / persistence.
评论 #2023957 未加载
nlover 14 years ago
Can someone please explain what FPGA-aware garbage collection is?
评论 #2022691 未加载