> The ex-Googler reflected that he missed the possibility of pages that link back to each other, causing "infinite recursion."<p>Although tangential to the billing issue, this is reckless. If you’re building a crawler of any kind, please, please, <i>please</i> prioritize ensuring this doesn’t happen so I don’t have to wake up at 3 AM.<p>I run the infrastructure for a moderate-sized site with probably about a hundred million pages or so. We can handle the HN hug-of-death just fine. But poorly-made crawlers that recurse like this? They’re increasingly problematic.<p>If your solution to fixing your crawler is “throw more concurrency at it and ignore the recursion,” and suddenly your requests start <i>timing out</i>, that’s a pretty damn strong hint that you’re ruining someone’s day.<p>From my perspective, this will look like an attack. I’ll see thousands of IP addresses repeatedly requesting the same pages, usually with generic user agent headers. Which ones are actual attacks, and which are just poorly-made crawlers? Well, if you’ve got a generic user agent string that doesn’t link to a contact page, and you’re circumventing rate limiting by changing your IP address, and you had the bright idea to let your test code run overnight, I’m going to treat it as an attack. At 3 AM, I’m not inclined to differentiate between negligence and malice.<p>This is happening more and more often, and I partially blame it on the ease of “accidentally” obtaining a ridiculous quantity of cloud resources. People deploy shoddy test code and go to bed. They turn it off in the morning when they see the bill.<p>It’s become so prevalent that our company has come up with an internal term for these crawlers that spin up a new thread/container for every page: snowballing crawlers.<p>Save a sysadmin: don’t snowball.<p>Oh, and include a useful user agent header so we can contact you instead of your cloud provider.