Incident report for February 21st, 2024

73 pointsby RyeCombinatorover 1 year ago

16 comments

Man, I use Resend, I really want to like them and yeah, it is really simple to get started which is great but man, this is, I think they really need to slow down a bit and maybe try to figure out how to put some processes in place to maybe intentionally slow things down.This is the second incident that can be characterized by a very hazy delineation between development and production environments. The first incident had to do with an attacker gaining access to private credentials due to devs leaving keys set in NEXT_PUBLIC environmental variables on their site.

评论 #39480315 未加载

kgeistover 1 year ago

Reminds of this classic thread: <a href="https://www.reddit.com/r/cscareerquestions/s/cqrama0L1z" rel="nofollow">https://www.reddit.com/r/cscareerquestions/s/cqrama0L1z</a>>accidentally destroyed production database on first day of jobSame symptoms: while developing a feature locally, they accidentally pointed to the production DB, which was destroyed when running tests.

globular-toastover 1 year ago

Sometimes I think the infra at the small company I work at isn't great, but we've not had direct dev access to prod DBs from day one. They're locked down at the IP level. You'd have to go through some serious hoops to accidentally connect to prod even if you had the keys.I remember as a child thinking adults had everything under control. That they know what they're doing. I guess I assumed a day would come when I too would know. That day never came. It's easy to think when you look at shiny websites and the first paragraph of this comment that other adults do know. But I'm often reminded of the truth: nobody knows. Everyone is always operating at least slightly outside their comfort zone or, in the case of the article, wildly.

yatish27over 1 year ago

"While building a feature, we performed a database migration command locally, but it incorrectly pointed to the production environment instead, which dropped all tables in production."This was scary.

评论 #39477345 未加载

评论 #39477445 未加载

评论 #39477283 未加载

评论 #39477190 未加载

评论 #39479875 未加载

评论 #39479308 未加载

doctor_evalover 1 year ago

If they’re small then you can see it happening where someone was logged into prod using some environment variables to sort out some issue - probably didn’t even update production, just a few queries - and then went back to work.Hours later they run some script that does DROP DATABASE from that same shell they used to troubleshoot, which takes a little longer than usual…Anyway I can totally see it happening to me in my little one man shop, but now i think I might look at removing some privileges from my prod account :)

评论 #39478762 未加载

_andrei_over 1 year ago

I hate it that some companies focus more now on how their product's website looks than on the product itself, its quality and stability. I also hate it that users are so used to having issues, errors, and their data leaked, that a startup can cut all the corners and do things completely the wrong way, and still have success. I hate that because I couldn't do it. Last month Resend leaked a database API key [0] from their env, how is that even possible? Now a developer tried to run a migration locally and did it on production, again, WHY is that possible? How can you talk about enterprise plans, SLAs, ACL features, when you can't do the basics right? What's it gonna be next month?[0] <a href="https://resend.com/blog/incident-report-for-january-10-2024">https://resend.com/blog/incident-report-for-january-10-2024</a>

walrus01over 1 year ago

From the company's homepage: "deliver marketing emails at scale"Maybe this company doesn't need to exist, and shouldn't.

评论 #39478144 未加载

评论 #39478189 未加载

anonzzziesover 1 year ago

This sounds like one of those horrible tools like prisma which drop everything if something is not in sync on dev. We removed this type of stupid in favour of our own which, you know, fixes this actually instead of lazily dropping everything when they cannot resolve some trivial thing, for instance, a new required field without default when there are already rows and other crap which they call ‘opinionated’. No idea why we ever used that stuff as it caused so much grief (never on prod though); after prisma/drizzle and some other ‘modern’ horrors, we mistrust everything that’s ’hip and new’ so ‘everyone uses it’. One of those hip and new things I mistrusted was Resend even though ‘suddenly everyone uses it’. I’ll wait a few years before even contemplating it.

评论 #39477442 未加载

评论 #39483601 未加载

tedchsover 1 year ago

I did this, around 2005, but I dropped ALL the prod tables. I was using a SQL GUI called Toad (awesome) and had separate windows open, for both "dev" and "prod". I was trying to reset the dev database, and used the wrong window. Thankfully, the "real" DBA at the time had 15-minute backups, and was able to restore it, and then I replayed a few transactions from logs. Lesson learned!> While building a feature, we performed a database migration command locally, but it incorrectly pointed to the production environment instead, which dropped all tables in production.

评论 #39479857 未加载

评论 #39480304 未加载

dpcxover 1 year ago

I did something like this nearly 15 years ago at a job. I was trying to use something like MySQL Workbench to export the database and generate a map of connections through foreign keys. Apparently I selected some option backwards and it deleted out the entire production database.Luckily for me I'd been working on some other things related to it and had taken a backup not long prior, but it was pretty nerve wracking to hear the CTO/CEO nearly running through the halls to find out what had happened.Pretty sure they never implemented more stringent access controls at that company, either.

nbittichover 1 year ago

incident report / post mortems could be the best way to promote a dev tool. Didn't know what resend was before someone deleted the prod database. Now I wonder if I need it / if I want it.

wrftaylorover 1 year ago

I love what Resend are doing and am a customer. I can also absolutely empathise as our lead engineer did exactly the same thing at a startup I was running a decade ago. It's a horrible situation.But yeah, both the incident and the report are really tough to read. It would be great if they can do a follow-up with further actions they're taking.There's a neo-bank called Revolut that allegedly at one point had just two teams: "go fast" and "don't screw it up". I feel like an infrastructure play needs some dedicated hires in camp 2.

PeterZaitsevover 1 year ago

You should always be ready for your database to get trashed - application bugs, operator error, hacker intervention...What strikes me in the incident report they focus on failed migration, where the real issue is not planning or not testing for recovery if migration goes very wrong.Even if backup recovery would take just 6 hours would it be acceptable ?

hwover 1 year ago

Production database access must always be locked down from external traffic, and only allow traffic from the production application or within the production environment.Aside from mitigating local dev accidentally pointing to the prod db, if you have the db accessible externally means it’s susceptible to network attacks and password attacks

rnts08over 1 year ago

This will keep happening as long as people are unable to learn from the past. Yes it's expensive to have a good experienced infrastructure engineer on the team, but at least you know there's someone testing your backups and procedures for when your eager dev team screw up.

pistoriuspover 1 year ago

Unfortunately these sort of mistakes are seen as a "right of passage" for many developers. I ran "`DELETE FROM users;` without a WHERE clause against production in my first year on the job. I felt absolutely terrible. I thought I was connected to a development machine.Fortunately we had backups available.Often this isn't a problem with the individual developer itself, but points to a problem with the organization. Frankly most developers shouldn't have access to a production database, let alone mutable access.One major concern is loss of data, but another is privacy.It's so frustrating to see this happening when there are tools that solve this like Snaplet (I'm a founder), and Replibyte that allow you to generate or obfuscate data for usage in dev-environments, and Neon that allows you to branch your database.

评论 #39480110 未加载

评论 #39479200 未加载