We survived 10k requests/second: Switching to signed asset URLs in an emergency

147 pointsby dyogenez9 months ago

34 comments

Sytten9 months ago

I really hope this is not the whole of your code otherwise you have a nice open redirect vulnerability on your hand and possibly a private bucket leak if you don't check which bucket you are signing the request for. Never for the love of security take an URL as input from a user without doing a whole lot of checks and sanitization. And don't expect your language parser to be perfect, Orange Tsai demonstrated they can get confused [1].[1] <a href="https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-New-Era-Of-SSRF-Exploiting-URL-Parser-In-Trending-Programming-Languages.pdf" rel="nofollow">https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-Ne...</a>

评论 #41261241 未加载

paxys9 months ago

Quick feedback – you've used the term "signed URL" over 50 times in the post without once explaining what it is or how it works.

评论 #41260650 未加载

评论 #41260636 未加载

评论 #41260823 未加载

dyogenez9 months ago

Earlier this week someone started hitting our Google Cloud Storage bucket with 10k requests a second... for 7 hours. I realized this while working from a coffee shop and spent the rest of the day putting in place a fix.This post goes over what happened, how we put an a solution in place in hours and how we landed on the route we took.I'm curious to hear how others have solved this same problem – generating authenticated URLs when you have a public API.

评论 #41260960 未加载

评论 #41262833 未加载

评论 #41260652 未加载

评论 #41260764 未加载

flockonus9 months ago

Have you considered putting cloudflare or similar CDN with unlimited egress in front of your bucket?Reading your blogpost I don't fully get how the current signing implementation can halt massive downloads, or the "attacker"(?) would just adapt their methods to get the signed URLs first and then proceed to download what they are after anyway?

评论 #41260644 未加载

评论 #41260580 未加载

评论 #41260628 未加载

评论 #41260596 未加载

评论 #41260878 未加载

arcfour9 months ago

I immediately groaned when I read "public bucket."On AWS you'd put CloudFront in front of the (now-private) bucket as a CDN, then use WAF for rate limiting, bot control, etc. In my experience GCP's services work similarly to AWS, so...is this not possible with GCP, or why wasn't this the setup from the get-go? That's the proper way to do things IMO.Signed URLs I only think of when I think of like, paid content or other "semi-public" content.

评论 #41260963 未加载

评论 #41260856 未加载

评论 #41261030 未加载

upon_drumhead9 months ago

Given that you want to be good stewards of book data, have you considered publishing bulk snapshots to archive.org on a set cadence? It would strongly reduce any needs to do any sort of bulk scraping and also ensure that should something happen to your service, the data isn't lost forever.

评论 #41262261 未加载

inopinatus9 months ago

I recall reports of cases like this nearly every day at AWS, and that was a decade ago.It wasn't unusual, for first-time victims at least, that we'd a) waive the fees and b) schedule a solution architect to talk them through using signed URLs or some other mitigation. I have no visibility into current practice either at AWS or GCP but I'd encourage OP to seek billing relief nevertheless, it can't hurt to ask. Sustainable customer growth is the public cloud business model, of which billing surprises are the antithesis.

评论 #41267643 未加载

languagehacker9 months ago

Did this guy just write a blog post about how he completely rewrote a functional feature to save $800?In all seriousness, the devil is in the details around this kind of stuff, but I do worry that doing something not even clever, but just nonstandard, introduces a larger maintenance effort than necessary.Interesting problem, and an interesting solution, but I'd probably rather just throw money at it until it gets to a scale that merits further bot prevention measures.

评论 #41261138 未加载

评论 #41260729 未加载

qaq9 months ago

Beauty of cloud :) This could be easily served by a $100/month DO droplet with 0 worries about $.

评论 #41260619 未加载

评论 #41260639 未加载

评论 #41260593 未加载

评论 #41260759 未加载

评论 #41260826 未加载

评论 #41263605 未加载

1a527dd59 months ago

I don't understand, why wasn't there a CDN in front of the public GCS bucket resources?

评论 #41260974 未加载

sakopov9 months ago

I must be missing something obvious, but what do signed URLs have to do with requests going directly to resources in a bucket instead of a CDN of some sort like Cloudflare? Signed URLs are typically used to provide secure access to a resource in a private bucket. But it seems like it's used as a cache of sorts?

评论 #41267560 未加载

elliot079 months ago

One suggestion to speed up perf. Use bucket#signed_url instead of file#signed_url, otherwise it's doing an HTTP request to Google every generation.

评论 #41267594 未加载

Waterluvian9 months ago

Do any cloud providers have a sensible default or easy-to-enable mode for “you literally cannot spend one penny until you set specific quotas/limits for each resource you’re allocating”?

评论 #41260569 未加载

rcarmo9 months ago

I had to do a similar thing a decade ago when someone started scraping my site by brute force. At the time I was using CoralCDN already, but my server was getting hammered, so I just started serving up assets with hashed URLs and changing the key every 24h--their scraper was dumb enough to not start again from scratch.I ended up using the exact same code for sharding, and later to move to a static site with Azure Storage (which lets me use SAS tokens for timed expiry if I want to).

评论 #41263961 未加载

andrewstuart9 months ago

I'm always surprised to read how much money companies are willing to spend on things that can be done for essentially nothing.I had a look at the site - why does this need to run on a major cloud provider at all? Why use VERY expensive cloud storage at 9 cents per gigabyte? Why use very expensive image conversion at $50/month when you can run sharp on a Linux server?I shouldn't be surprised - the world is all in on very expensive cloud computing.There's another way though assuming you are running something fairly "normal" (whatever that means) - run your own Linux servers. Serve data from those Linux computers. I use CloudFlare R2 to serve your files - its free. You probably don't need most of your fancy architecture - run a fast server on Ionos or Hetzner or something and stop angsting about budget alerts from Google for things that should be free and runnong on your own computers - simple,. straightforward and without IAM spaghetti and all that garbage.EDIT: I just had a look at the architecture diagram - this is overarchitected. This is a single server application that almost has no architecture - Caddy as a web server - a local queue - serve images from R2 - should be running on a single machine on a host that charges nothing or trivial amount for data.

评论 #41261116 未加载

评论 #41261044 未加载

评论 #41261034 未加载

评论 #41260601 未加载

评论 #41260683 未加载

评论 #41261514 未加载

评论 #41260785 未加载

hansvm9 months ago

What I just read is that for the cost of a single 16TB hard drive, they were able to rent a hard drive for 7 hours to stream 16TB, and they still had to devote meaningful engineering resources to avoid the cost overrun.Does anybody here have a success story where AWS was either much cheaper to operate or to develop for (ideally both) than the normal alternatives?

评论 #41263597 未加载

评论 #41264067 未加载

intelVISA9 months ago

10k/s... is that a lot? Computers are insanely fast nowadays..!

评论 #41263637 未加载

评论 #41265263 未加载

0xbadcafebee9 months ago

Rate limiting (and its important cousin, back-off retries) is an important feature of any service being consumed by an "outside entity". There are many different reasons you'll want rate limiting at every layer of your stack, for every request you have: brute-force resistance, [accidental] DDoS protection, resiliency, performance testing, service quality, billing/quotas, and more.Every important service always eventually gets rate limiting. The more of it you have, the more problems you can solve. Put in the rate limits you think you need (based on performance testing) and only raise them when you need to. It's one of those features nobody adds until it's too late. If you're designing a system from scratch, add rate limiting early on. (you'll want to control the limit per session/identity, as well as in bulk)

评论 #41265146 未加载

austin-cheney9 months ago

10k requests per second has historically been a lower challenge to overcome than 10k concurrent sessions on a single box. 10k concurrent sessions was the historic design goal for standing up Node.js 15 years ago.For everything high traffic and/or concurrency related my go to solution is dedicated sockets. Sockets are inherently session-oriented which makes everything related to security and routing more simple. If there is something about a request you don’t like then just destroy the socket. If you believe there is a DOS flood attack then keep the socket open and discard its messaging. If there are too many simultaneous sockets then jitter traffic processing via load balancer as resources become available.

paulddraper9 months ago

Remember kids, CDNs are your friend.You can roll/host your own anything. Except CDN, if you care about uptime.

solatic9 months ago

Did you try sticking your bucket behind Cloud CDN?Google's documentation is inconsistent, but you do not need to make your bucket public, you can instead grant read access only to Cloud CDN: <a href="https://cloud.google.com/cdn/docs/using-signed-cookies#configure_permissions" rel="nofollow">https://cloud.google.com/cdn/docs/using-signed-cookies#confi...</a>Dangerously incorrect documentation claiming the bucket must be public: <a href="https://cloud.google.com/cdn/docs/setting-up-cdn-with-bucket#make_your_bucket_public" rel="nofollow">https://cloud.google.com/cdn/docs/setting-up-cdn-with-bucket...</a>

评论 #41267633 未加载

twothamendment9 months ago

We recently had a bot from Taiwan downloading all of our images, over and over and over - similar to the author. By the time we noticed they had downloaded them many times over and showed no signs of stopping!Bots these days are our of control and have lost their mind!

评论 #41264287 未加载

feurio9 months ago

Maybe it's just me, but isn't ~10K r/s pretty much just, well, normal?

评论 #41262161 未加载

nirui9 months ago

In addition to "signing" the URL, you may also require users to login to view the original image, and serve visitors a compressed version. This could give you the benefit of gaining users (good for VC) while respecting the guests, as well as protecting your investments.Back in the old days where everyone operates their own server, another thing you could do is to just setup a per-IP traffic throttling with iptables (`-m recent` or `-m hashlimit`). Just something to consider in case one day you might grow tired of Google Cloud Storage too ;)

hypeatei9 months ago

So your fix was to move the responsibility to the web server and Redis instance? I guess that works but introduces a whole lot more complexity (you mentioned adding rate limiting) and potential for complete outage in the event a lot of requests for images come in again.

评论 #41260976 未加载

taeric9 months ago

I'm confused, isn't this literally the use case for a CDN?Edit: I see this is discussed in other threads.

评论 #41261129 未加载

aftbit9 months ago

I solved this problem for free with storage on B2 and a Cloudflare worker which offers free egress from B2. I don't know if they'd still offer it for free at 10k rps though!

Alifatisk9 months ago

I can't describe the surprise when I saw RoR being mentioned, that was unexpected but made the article way more exciting to read.Wouldn't this be solved by using Cloudflare R2 though?

评论 #41261004 未加载

quectophoton9 months ago

Thank you for saying it as 10k requests/second. It makes it way more clear than if you had instead said requests/minute, or worse, requests/day.

mannyv9 months ago

We put assets in backblase and use fastly to cdn, because the cost is super low. It's a bit more work but super cheap.

评论 #41262522 未加载

EGreg9 months ago

We've designed our system for this very use case. Whether it's on commodity hardware or in the cloud, whether or not it's using a CDN and edge servers, there are ways to "nip things in the bud", as it were, by rejecting requests without a proper signed payload.For example, the value of session ID cookies should actually be signed with an HMAC, and checked at the edge by the CDN. Session cookies that represent a authenticated session should also look different than unauthenticated ones. The checks should all happen at the edge, at your reverse proxy, without doing any I/O or calling your "fastcgi" process manager.But let's get to the juicy part... hosting files. Ideally, you shouldn't have "secret URLs" for files, because then they can be shared and even (gasp) hotlinked from websites. Instead, you should use features like X-Accel-Redirect in NGINX to let your app server determine access to these gated resources. Apache has similar things.Anyway, here is a write-up which goes into much more detail: <a href="https://community.qbix.com/t/files-and-storage/286" rel="nofollow">https://community.qbix.com/t/files-and-storage/286</a>

评论 #41263434 未加载

评论 #41261047 未加载

the84729 months ago

The dreaded C10k problem, remaining unsolved to this day.

评论 #41261010 未加载

lfmunoz49 months ago

Don't understand why hosting providers charge for egress. Why isn't it free? Doesn't that mean that we don't have an open internet, isn't that against net neutrality?

busymom09 months ago

> The previous day I was experimenting with Google Cloud Run, trying to migrate our Next.js staging environment from Vercel to there to save some money. I assumed I misconfigured that service and turned it off and went about my day.I am sorry but who sees a $100 sudden charge, assumes misconfiguration and just goes about their day without digging deeper right away?

评论 #41261833 未加载