> This isn’t an astronomical number, but it’s large enough that we (at least, as a bootstrapped company) have to care about cost efficiency.<p>... by externalizing the costs to a third party.<p>In general, I'm really surprised that they published this article. It's like they described exactly the data that somebody working on preventing scraping would need to block this traffic, in totally unnecessary level of detail. (E.g. telling exactly which ASN this traffic would be arriving from, describing the very specific timing of their traffic spikes, the kind of multi-city searches that probably see almost no organic traffic).<p>I just don't get it. It's like they're intentionally trying to get blocked so that they can write a follow-up "how Google blocked our bootstrapped business" blog post.
> The crawl function reads a URL from the SQS queue, then Pyppeteer tells Chrome to navigate to that page behind a rotating residential proxy. The residential proxy is necessary to prevent Google from blocking the IP Lambda makes requests from.<p>I am very interested in what a 'rotating residential proxy' is. Are they routing requests through random people's internet connections? Are these people willing participants? Where do they come from?
To those lamenting that they're scraping... Google is the biggest scraper of them all. Facebook, Amazon, Google, Microsoft. All the big boys scrape voraciously, yet try their best to block themselves from being scraped. Scraping is vital for the functionality of the internet. The narrative that scraping is evil is what big companies want you to think.<p>When you block small scrapers from your site but permit giants like Googlebot and Bing all you're doing is locking in a monopoly that's bad for everyone
It's ironic writing an article like that, while their ToS states:<p>> As a user of the Site, you agree not to:<p>> 1. Systematically retrieve data or other content from the Site to create or compile, directly or indirectly, a collection, compilation, database, or directory without written permission from us.
It's strange they write about this so openly. Aren't they wary that someone at Google Fights will read it and they will try blocking them? (E.g. by scrambling the page's code)
Interesting. A scraper scraping a scraper. I don't get what the value add is over clients just searching Google Flights directly. Not trying to be mean, just trying to understand.
Flights isn't really the best way of getting cheap flights. They pepper the results, especially if they think you're scraping (which they probably do). Matrix is more accurate. Using a GDS is even more accurate but that costs money.
Hey Gus, you might be interested in <a href="https://pricelinepartnernetwork.com/" rel="nofollow">https://pricelinepartnernetwork.com/</a> (take a look at the API part for example)<p>(Disclaimer: I work for priceline).
The way I read it, they scrape 25k pages per day?<p>I wonder if that could already bring them on Googles radar. If so, Google would probably send a cease and desist letter and this startup would simply give up.<p>I wonder if Google would also demand their legal expenses? Probably a couple thousand dollars?<p>I know, nobody would go to court against Google - but what would happen if this <i>did</i> go to court? Which laws would Google cite to deem this illegal?
Reader mode in case you don't prefer Medium: <a href="https://baitblock.app/read/medium.com/brisk-voyage/how-we-scrape-300k-flight-prices-per-day-from-google-flights-79f5ddbdc4c0" rel="nofollow">https://baitblock.app/read/medium.com/brisk-voyage/how-we-sc...</a>
All the (AWS) technologies used are totally unnecessary. SQS/DynamoDB/Lambda. I can buy a laptop in walmart for $500 and i can do all the scrapping in starbucks wifi.