TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

How we scrape 300k prices per day from Google Flights

47 pointsby gusgordonalmost 5 years ago

14 comments

jsnellalmost 5 years ago
&gt; This isn’t an astronomical number, but it’s large enough that we (at least, as a bootstrapped company) have to care about cost efficiency.<p>... by externalizing the costs to a third party.<p>In general, I&#x27;m really surprised that they published this article. It&#x27;s like they described exactly the data that somebody working on preventing scraping would need to block this traffic, in totally unnecessary level of detail. (E.g. telling exactly which ASN this traffic would be arriving from, describing the very specific timing of their traffic spikes, the kind of multi-city searches that probably see almost no organic traffic).<p>I just don&#x27;t get it. It&#x27;s like they&#x27;re intentionally trying to get blocked so that they can write a follow-up &quot;how Google blocked our bootstrapped business&quot; blog post.
评论 #23511777 未加载
评论 #23511878 未加载
评论 #23511829 未加载
cortesoftalmost 5 years ago
&gt; The crawl function reads a URL from the SQS queue, then Pyppeteer tells Chrome to navigate to that page behind a rotating residential proxy. The residential proxy is necessary to prevent Google from blocking the IP Lambda makes requests from.<p>I am very interested in what a &#x27;rotating residential proxy&#x27; is. Are they routing requests through random people&#x27;s internet connections? Are these people willing participants? Where do they come from?
评论 #23511962 未加载
评论 #23511957 未加载
评论 #23512106 未加载
randombytes6869almost 5 years ago
To those lamenting that they&#x27;re scraping... Google is the biggest scraper of them all. Facebook, Amazon, Google, Microsoft. All the big boys scrape voraciously, yet try their best to block themselves from being scraped. Scraping is vital for the functionality of the internet. The narrative that scraping is evil is what big companies want you to think.<p>When you block small scrapers from your site but permit giants like Googlebot and Bing all you&#x27;re doing is locking in a monopoly that&#x27;s bad for everyone
评论 #23512034 未加载
cleansyalmost 5 years ago
It&#x27;s ironic writing an article like that, while their ToS states:<p>&gt; As a user of the Site, you agree not to:<p>&gt; 1. Systematically retrieve data or other content from the Site to create or compile, directly or indirectly, a collection, compilation, database, or directory without written permission from us.
评论 #23511980 未加载
评论 #23514311 未加载
dmortinalmost 5 years ago
It&#x27;s strange they write about this so openly. Aren&#x27;t they wary that someone at Google Fights will read it and they will try blocking them? (E.g. by scrambling the page&#x27;s code)
评论 #23511725 未加载
评论 #23511796 未加载
评论 #23512737 未加载
评论 #23511700 未加载
dlhavemaalmost 5 years ago
Interesting. A scraper scraping a scraper. I don&#x27;t get what the value add is over clients just searching Google Flights directly. Not trying to be mean, just trying to understand.
评论 #23511652 未加载
评论 #23511622 未加载
nunezalmost 5 years ago
Flights isn&#x27;t really the best way of getting cheap flights. They pepper the results, especially if they think you&#x27;re scraping (which they probably do). Matrix is more accurate. Using a GDS is even more accurate but that costs money.
dandanioalmost 5 years ago
Hey Gus, you might be interested in <a href="https:&#x2F;&#x2F;pricelinepartnernetwork.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;pricelinepartnernetwork.com&#x2F;</a> (take a look at the API part for example)<p>(Disclaimer: I work for priceline).
founderlingalmost 5 years ago
The way I read it, they scrape 25k pages per day?<p>I wonder if that could already bring them on Googles radar. If so, Google would probably send a cease and desist letter and this startup would simply give up.<p>I wonder if Google would also demand their legal expenses? Probably a couple thousand dollars?<p>I know, nobody would go to court against Google - but what would happen if this <i>did</i> go to court? Which laws would Google cite to deem this illegal?
BaitBlockalmost 5 years ago
Reader mode in case you don&#x27;t prefer Medium: <a href="https:&#x2F;&#x2F;baitblock.app&#x2F;read&#x2F;medium.com&#x2F;brisk-voyage&#x2F;how-we-scrape-300k-flight-prices-per-day-from-google-flights-79f5ddbdc4c0" rel="nofollow">https:&#x2F;&#x2F;baitblock.app&#x2F;read&#x2F;medium.com&#x2F;brisk-voyage&#x2F;how-we-sc...</a>
mongodbhateralmost 5 years ago
All the (AWS) technologies used are totally unnecessary. SQS&#x2F;DynamoDB&#x2F;Lambda. I can buy a laptop in walmart for $500 and i can do all the scrapping in starbucks wifi.
评论 #23511779 未加载
评论 #23513959 未加载
评论 #23511912 未加载
评论 #23514336 未加载
nojitoalmost 5 years ago
You state that you care about costs but you end up using some of the most expensive cloud offerings out there?
评论 #23512008 未加载
评论 #23514339 未加载
ykevinatoralmost 5 years ago
This is awesome
tpmxalmost 5 years ago
The Internet is not series of tubes. It&#x27;s a series of leeches...