TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Golang or Python for the large scale distributed crawler?

1 pointsby gerenukabout 8 years ago
To deal better with concurrency which one would you opt for, as for the task queues, I have checked golang has &quot;Machinery&quot; library and for python its celery (we are currently using this). The bottleneck we are seeing is celery is very hard to debug for the issues in case of the gevent.<p>Sometimes it works fine with 100 threads, and sometimes it completely stalls and become idle without giving any errors etc.<p>I was thinking about this approach:<p>- Golang to take care of the requests&#x2F;HTTP work and store them in the DB. - Python for the NLP tasks etc.<p>Any insights&#x2F;suggestion would be helpful.

1 comment

tuxlinuxienabout 8 years ago
Are you using 100 threads or coroutines? because 100 threads on a single machine will no give you nice performance at all.<p>I have started working a crawler in Go and each website to crawl can be configured via YAML files. The quantity of code written is a bit bigger than the python one made before, but a single micro instance on amazon is way enough to &quot;read&quot; few hundred websites every hour. Personally, I think the goroutine system is way easier to deal with compared to the Python one.<p>So if I was you, I would definitely give the IO processing to golang, and let python deal with the data processing.