I do a significant amount of scraping for hobby projects, albeit mostly open websites. As a result, I've gotten pretty good a circumventing rate-limiting and most other controls.<p>I suspect I'm one of those bad people your parents tell you to avoid - by that I mean I completely ignore robots.txt.<p>At this point, my architecture has settled on a distributed RPC system with a rotating swarm of clients. I use RabbitMQ for message passing middleware, SaltStack for automated VM provisioning, and python everywhere for everything else. Using some randomization, and a list of the top n user agents, I can randomly generate about ~800K unique but valid-looking UAs. Selenium+PhantomJS gets you through non-capcha cloudflare. Backing storage is Postgres.<p>Database triggers do row versioning, and I wind up with what is basically a mini internet-archive of my own, with periodic snapshots of a site over time. Additionally, I have a readability-like processing layer that re-writes the page content in hopes of making the resulting layout actually pleasant to read on, with pluggable rulesets that determine page element decomposition.<p>At this point, I have a system that is, as far as I can tell, definitionally a botnet. The only things is I actually pay for the hosts.<p>---<p>Scaling something like this up to high volume is really an interesting challenge. My hosts are physically distributed, and just maintaining the RabbitMQ socket links is hard. I've actually had to do some hacking on the RabbitMQ library to let it handle the various ways I've seen a socket get wedged, and I still have some reliability issues in the SaltStack-DigitalOcean interface where VM creation gets stuck in a infinite loop, leading to me bleeding all my hosts. I also had to implement my own message fragmentation on top of RabbitMQ, because literally no AMQP library I found could <i>reliably</i> handle large (>100K) messages without eventually wedging.<p>There are other fun problems too, like the fact that I have a postgres database that's ~700 GB in size, which means you have to spend time considering your DB design and doing query optimization too. I apparently have big data problems in my bedroom (My home servers are in my bedroom closet).<p>---<p>It's all on github, FWIW:<p>Manager: <a href="https://github.com/fake-name/ReadableWebProxy" rel="nofollow">https://github.com/fake-name/ReadableWebProxy</a><p>Agent and salt scheduler: <a href="https://github.com/fake-name/AutoTriever" rel="nofollow">https://github.com/fake-name/AutoTriever</a>