This is more just "missed optimization opportunities in EC2" than a statement about mistakes in AWS as a whole.<p>If you want to talk systemic AWS mistakes you can make, we accidentally created an infinite event loop between two Lambdas. Racked up a several-hundred-thousand dollar bill in a couple of hours. You can accidentally create this issue across lots of different AWS services if you don't verify you haven't created any loops between resources and don't configure scaling limitations where available. "Infinite" scaling is great until you do it when you didn't mean to.<p>That being said, I think AWS (can't speak for other big providers) does offer a lot of value compared to bare-metal and self-hosting. Their paradigms for things like VPCs, load balancing, and permissions management are something you end up recreating in most every project anyways, so might as well railroad that configuration process. I've experienced how painful companies that tried to run their own infrastructure made things like DB backups and upgrades that it would be hard to go back to a non-managed DB service like RDS for anything other than a personal project.<p>After so many years using AWS at work, I'd never consider anything besides Fargate or Lambda for compute solutions, except maybe Batch if you can't fit scheduled processes into Lambda's time/resource limitations. If you're just going to run VMs on EC2, you're better off with other providers that focus on simple VM hosting.
AWS is complexity-as-a-service. This is why, as a one-man company, I went baremetal[1]. One flat price, screaming fast performance, and massive scalability if you get a beefy enough machine[2]. I don't have time to fiddle with k8s, try to figure out AWS billing/performance tradeoffs, or deal with untraceable performance issues due to noisy neighbours and VM overhead. My disaster recovery plan is a simple DB dump script to S3, and I know I can get another baremetal server up and running in less than 20 minutes.<p>[1] with IBM Cloud 1 year free startup credits<p>[2] Let's Encrypt and StackOverflow run their entire databases on a single beefy baremetal machine. <a href="https://letsencrypt.org/2021/01/21/next-gen-database-servers.html" rel="nofollow">https://letsencrypt.org/2021/01/21/next-gen-database-servers...</a>
Biggest mistake I’ve made:<p>Shifting any non trivial infrastructure into AWS verbatim is always more expensive than running it yourself. You need to rearchitect it carefully around the PaaS services to make a cost saving or even break even.<p>An extreme example of this is it cousin who works for a small dev company doing LOB stuff. They moved their SQL box into EC2 and it’s costing more to run that single RDS instance than their entire legacy infra cost was per year.<p>I’d still rather use AWS though. The biggest gain is not technology but not having to argue with several vendor sales teams or file a PO and wait for finance to approve it. All I do is click a button and the thing’s there.
I've made it a habit to absolutely avoid any and all AWS services for any side projects, unless it's on the employer's dime. I'd rather pay a bit more per month for a flat-fee Digital Ocean droplet. Maybe I'll end up paying a few dollars more than I would with the equivalent AWS setup, but I'll rest easy knowing I won't get a surprise bill thanks to the opaque and byzantine billing. I mean, there are consultancies whose entire premise is expertise on AWS billing, so the chance of AWS newbie-me running up many thousands because I forgot to switch off service A or had the wrong setting for service B is non-zero.<p>And the general advice is "don't worry, call their customer support and they'll refund you". Um, seriously? If I want to spend a morning on hold to deal with a huge unplanned bill I'll call my local tax office, thank you.<p>Which sucks as I learn best by building things in my spare time, but AWS makes that learning process a bit more stressful than I'd prefer.
I nearly made myself a very nice footgun not long since.<p>So MediaConvert (video transcoding), direct s3 upload to s3 bucket, bucket fires event to my application, my application builds the job and submits it to media convert with the output bucket as the destination.<p>Straight forward enough, unless you happen to be copying a config tired and put your input/output buckets as the same bucket...<p>Fortunately previous-me was paranoid enough to have put in an if check and die if they where the same but otherwise that could have cost a lot of money.
Nothing for me compares to the time I purchased 2 reserved EC2 instances for about $5K on my personal account rather than companies. I can still remember that sinking feeling as I realized what I'd done.<p>Amazon refunded the next day.
In summary: Either overprovisioning, or not realising every extra CPU cycle or I/O operation costs extra money.<p>This is, of course, the real way "the cloud" makes money. Carefully tuned, it can no doubt be cheaper than do-it-yourself, however, it is also quite easy to make a lot of costs.
My favorite billing mistake was forgetting to delete an unused elastic IP address and then realizing I was being charged $34 / month for 2 months just to have it exist while doing nothing.<p>Edit: It's exactly $33.62 and I was mistaken on what caused it. It came from having a NAT Gateway just idling which is $0.045 per hour x 747 hours = $33.62 on us-east-1.<p>I know it's not the biggest mistake ever, but these things creep up on you when you use CloudFormation and it continuously fails to delete resources so you're left having to manually trace through a bunch of resources. It's easy to leave things hanging.
Few easy ones as well:<p>1) Terminating instances that had ephemeral disks with stuff you needed while thinking the EBS volumes would remain<p>2) Leaving NAT gateways lying around or ELBs that do nothing and have no instances attached.<p>3) Public S3 buckets - arguably the most common one that can lead to security incidents<p>4) Debugging security groups/Network ACLs and straight up break networking for something without knowing it. Reverse of that would be you want to fix something quickly and open 0.0.0.0/0 to everyone and never get around to tightening up the firewall later on.
One of the biggest mistakes I made is not exploring spot instances and reserved instances earlier.<p>I cut my bill by 70-80%% after paying full price for years...<p>If you have an active web server or backend workers with fairly short jobs, spot instances will work for you.
I view AWS as a study in doing everything the "bare hands" way. Here are some examples of the old sysadmin ways of doing things vs the modern "web" way:<p>* regions -> self-balancing algorithms like RAFT<p>* roles/permissions -> tokens<p>* IP address filtering -> tokens<p>* CPU clusters -> multicore/containerization/Actor model<p>* S3 -> IPFS or similar content-addressable filesystems<p>It's not just AWS having to deal with this stuff either:<p>* CORS -> Subresource Integrity (SRI)<p>* server languages (CGI) -> Server-Side Includes (SSI)<p>* Javascript -> functional reactive, declarative and data-driven components within static HTML<p>* async -> sandbox processes, fork/join, auto-parallelization (seen mostly in vector languages but extendable to higher-level functions)<p>* CSS -> a formal inheritance spec (analogous to knowing set theory vs working around SQL errata)<p>I could go on forever but I'll stop there. We are living at a very interesting time in the evolution of the web. I think that web dev has reached the point where desktop dev was in the mid-1990s and is ripe for disruption. No disruption will come from the big companies though, so this is your chance to do it from your parents' basement!
Ok im going to admit to a mistake revolving around NAT gateways and Lambdas.
So, i basically wanted to connect a Lambda to a Postgres / RDS database, for that I had to put into a private VPC, but the lambdas still had to talk to the world (a lot) so i just put a nat gateway around it no biggy.
Well, end of the story on one day i produced 2000 Euro in cost for the Nat gateway haha
My biggest mistake: years ago I ended pushing personal credentials to GitHub at night and waking up to a several thousand dollar bill in the morning.<p>Changed credentials and cancelled all the running instances only to find that I’d missed some.<p>It was resolved by the afternoon.
But what mistakes did he make?
Did he screw up the bill? Did he fail to keep services available? I only read facts about the ins and outs of AWS' billing and credits system.
Burst CPU and IOPS has bitten me a couple times over the years. In fact, it’s basically the sole cause of nearly all our downtime in recent history. That’s frustrating. I get that it’s a technical solution to the problem of resource utilization at scale, but they could’ve spent some time making it easier to observe — for example, rescale the CPU or IOPS graphs so that 100% is your max sustained budget, and anything over 100% eats into your quota.
Slightly OT: I love Forge but recently I've started using it for my non-PHP projects which feels... wrong. Are there any similar services that are more agnostic?
On billing.. they will never do it, but on smaller accounts they could build trust by offering some sort of "prepaid" mode like cell phone services do at the low end.<p>That is - you deposit $X in your account, and AWS nukes your live services if you breach it. The worst that ever happens is you are out sunk cost of the $X you had already deposited.
<i>"Technically they are a smidgen slower than Intel for certain workloads."</i><p>In my experience, after migrating several servers with quite varying workloads, they're <i>faster</i> than Intel - and more than a smidgen. Just as is the general case with current AMD Ryzen vs Intel.
[Disclosure] I'm Co-Founder and CEO of <a href="http://vantage.sh/" rel="nofollow">http://vantage.sh/</a>, a cloud cost platform for AWS. Previously I was a product manager at AWS and DigitalOcean.<p>Since the author and so many people are commenting about AWS costs (and in particular, choosing cheaper EC2 instances and EBS volumes) I thought I'd mention that Vantage has recommendations that look to tell you for these exact things so you don't get tripped up / spend more than you have to.<p>If you have "antiquated" EC2 instances or EBS volumes, Vantage will give you a recommendation for which instance to switch to and how much money you'll save.<p>The first $2,500/month in AWS costs are also tracked for free so people get a lot of value out of the free tier and can save significant parts of their bills when developing on AWS.
On a price sensitive project I almost exclusively used spot instances at a <i>dramatically</i> reduced price over on-demand. It forced me to built high availability elements into the design at the outside, though ultimately spot instances got shut down no less frequently than my experience with on demand maintenance and individual machine outages.<p>Obviously mileage will vary, but going in I was under the impression that spot instances were on the knife's edge, when with a decent pricing strategy they're as robust as on demand at a fraction of the cost.