Postmortem of database outage of January 31

377 点作者 mbrain超过 8 年前

35 条评论

ky738超过 8 年前

RIP the engineer

评论 #13620253 未加载

评论 #13622603 未加载

illumin8超过 8 年前

I have to say - if they were using a managed relational database service, like Amazon's RDS Postgres, this likely would have never happened. RDS fully automates nightly database snapshots, and ships archive logs to S3 every 5 minutes, which gives you the ability to restore your database to any point in time within the last 35 days, down to the second.Also, RDS gives you a synchronously replicated standby database, and automates failover, including updating the DNS CNAME that the clients connect to during a failover (so it is seamless to the clients, other than requiring a reconnect), and ensuring that you don't lose a single transaction during a failover (the magic of synchronous replication over a low latency link between datacenters).For a company like Gitlab, that is public about wanting to exit the cloud, I feel like they could have really benefited from a fully managed relational database service. This entire tragic situation could have never happened if they were willing to acknowledge the obvious: managing relational databases is hard, and allowed someone with better operational automation, like AWS, to do it for them.

评论 #13620622 未加载

评论 #13620313 未加载

评论 #13620174 未加载

评论 #13620305 未加载

评论 #13620896 未加载

评论 #13621813 未加载

评论 #13620435 未加载

评论 #13622435 未加载

评论 #13620494 未加载

评论 #13622748 未加载

评论 #13621398 未加载

评论 #13620125 未加载

KayEss大约 8 年前

The engineers still seem to have a physical server mindset rather than a cloud mindset. Deleting data is always extremely dangerous and there was no need for it in this situation.They should have spun up a new server to act as secondary the moment replication failed. This new server is the one you run all of these commands on, and if you make a mistake you spin up a new one.Only when the replication is back in good order do you go through and kill the servers you no longer need.The procedure for setting up these new servers should be based on the same scripts that spin up new UAT servers for each release. You spin up a server that is a near copy of production and then do the upgrade to new software on that. Only when you've got a successful deployment do you kill the old UAT server. This way all of these processes are tested time and time again and you know exactly how long they'll take and iron out problems in the automation.

评论 #13621728 未加载

评论 #13623015 未加载

评论 #13623834 未加载

评论 #13625552 未加载

meowface超过 8 年前

>Trying to restore the replication process, an engineer proceeds to wipe the PostgreSQL database directory, errantly thinking they were doing so on the secondary. Unfortunately this process was executed on the primary instead. The engineer terminated the process a second or two after noticing their mistake, but at this point around 300 GB of data had already been removed.I could feel the sweat drops just from reading this.I'd bet every one of us has experienced the panicked Ctrl+C of Death at some point or another.

评论 #13620391 未加载

评论 #13621255 未加载

评论 #13620663 未加载

评论 #13620135 未加载

评论 #13620129 未加载

评论 #13623146 未加载

atmosx大约 8 年前

Great to have a full-featured, professional post-mortem. Incidentally I work at a company that suffered data loss because of this outage and we're looking for ways to move out of GL.My 2 cents... I might be the only one, but I don't like the way GL handled this case. I understand transparency as a core value and all, but they've gotten a bit too far.IMHO this level of exposure has far-reaching, privacy implications for the ppl who work there. Implications that cannot be assessed now.The engineer in question might have not suffered a PTSD, but some other engineer might haven been. Who knows how a bad public experience might play out? It's a fairly small circle, I'm not sure I would like to be part of a company that would expose me in a similar fashion, if I happen to screw up.On the corporate side of things there is a saying in Greek: "Τα εν οίκω μη εν δήμω" meaning don't wash your dirty linen in public. Although they're getting praised by bloggers and other small-size startups, in the end of the day exposing your 6-layer broken backup policy and other internal flaws in between, while being funded at the tune of 25.62M in 4 rounds, does not look good.

评论 #13622645 未加载

gr2020超过 8 年前

Reading this, the thing that stuck out to me was how remarkably lucky they were to have the two snapshots. The one from 6 hours earlier was there seemingly by chance, as an engineer had created it for unrelated reasons. And for both the 6- and 24-hour snapshots, it seems just lucky that neither had any breaking changes made to them by pre-production code (they _were_ dev/staging snapshots, after all).I'm glad it all worked out in the end!

评论 #13619964 未加载

greenrd超过 8 年前

GitHub also lost a bunch of PRs and issues sitewide early in their history. They claimed to have restored all the PRs from backup, but I was pretty sure I had opened a PR and it never came back. I emailed support and they basically told me tough luck.

ancarda大约 8 年前

>Unfortunately DMARC was not enabled for the cronjob emails, resulting in them being rejected by the receiver. This means we were never aware of the backups failing, until it was too late.At my dayjob, we gradually stopped using email for almost all alerts, instead we have several Slack channels like #database-log where errors to MySQL go. Any cron jobs that fail post in #general-log. Uptime monitoring tools post in #status. So on...Email has so much anti-spam stuff like DMARC that make it less reliable your mail will be delivered. For something failing like a backup or database query, it's too important to have potentially not reach someone who can make sure it gets fixed.My 2 cents.

评论 #13622852 未加载

评论 #13626642 未加载

matt_wulfeck大约 8 年前

> Unfortunately this process was executed on the primary instead. The engineer terminated the process a second or two after noticing their mistake, but at this point around 300 GB of data had already been removed.I can only image this engineer's poor old heart after the realization of removing that directory on the master. A sinking, awful feeling of dread.I've had a few close calls in my career. Each time it's made me pause and thank my luck it wasn't prod.

nowarninglabel超过 8 年前

Thanks so much for the post and transparency Gitlab! We had just finished recovering from our own outage (stemming for a power loss and subsequent cascading failures) and were scheduled to do our post-mortem on 2/1 so the original document was a refreshing and reassuring read.

评论 #13620243 未加载

aabajian大约 8 年前

This is an outstanding writeup, but I wonder if it glosses over the real problem:>>The standby (secondary) is only used for failover purposes.>>One of the engineers went to the secondary and wiped the data directory, then ran pg_basebackup.IMO, secondaries should be treated exactly as their primaries. No operation should be done on a secondary unless you'd be OK doing that same operation on the primary. You can always create another instance for these operations.

voidlogic超过 8 年前

>When we went to look for the pg_dump backups we found out they were not there. The S3 bucket was empty, and there was no recent backup to be found anywhere. Upon closer inspection we found out that the backup procedure was using pg_dump 9.2, while our database is running PostgreSQL 9.6 (for Postgres, 9.x releases are considered major). A difference in major versions results in pg_dump producing an error, terminating the backup procedure.Yikes. One common practice that would have avoided this is by using the just taken backup to populate stage. If the restore fails pages go out. If integration tests that run after a successful restore/populate fail- pages go out.Live and learn I guess.

_Marak_超过 8 年前

I've noticed a lot of other positive activity and press for Gitlab for in the past month.It's unfortunate they had this technical issue, but it's good to see others ( besides Github ) operating in this space. I should give Gitlab a try sometime.

评论 #13622740 未加载

pradeepchhetri大约 8 年前

Just want to add here that using tools like safe-rm[1] across your infrastructure would help in preventing data losses by running rm on unintended directories.[1]: <a href="https://launchpad.net/safe-rm" rel="nofollow">https://launchpad.net/safe-rm</a>

jsperson超过 8 年前

>An ideal environment is one in which you can make mistakes but easily and quickly recover from them with minimal to no impact.This is a great attitude. Too often opportunity cost isn't considered when making rules to protect folks from doing something stupid.

yarper超过 8 年前

It's amazing how quickly it descends into "one of the engineers" did x or y. Who was steering this ship exactly?It's really simple to point the finger and try to find a single cause of failure - but it's a fools errand - comparable to finding the single source behind a great success.

评论 #13620131 未加载

评论 #13621009 未加载

isoos超过 8 年前

sytse and GitLab folks: thank you for the transparency.

评论 #13620241 未加载

samat超过 8 年前

Am I missing something or didn't they mention 'test recovery, not backups'?

评论 #13620332 未加载

XorNot大约 8 年前

The backup situation stands out to me as a problem no one has really adequately solved. Verifying a task has happened in a way where the notifications are noticed is actually a really hard problem that it feels like we collectively ignore in this business.How do you reliably check if something didn't happen? Is the backup server alive? Did the script work? Did the backup work? Is the email server working? Is the dashboard working? Is the user checking their emails (think: wildcard mail sorting rule dumping a slight change in failure messages to the wrong folder).And the converse answer isn't much better: send a success notification...but if it mostly succeeds, how do you keep people paying attention to it when it doesn't (i.e. no failure message, but no success message)?The best answer I've got, personally, is to use positive notifications combined with visibility - dashboard your really important tasks with big, distinctive colors - use time based detection and put a clock on your dashboard (because dashboards which mostly don't change might hang and no one notice).

nodesocket超过 8 年前

My main question is still:>> Why did replication stop? - A spike in database load caused the database replication process to stop. This was due to the primary removing WAL segments before the secondary could replicate them.Is this a bug/defect in PostgreSQL then? Incorrect PostgreSQL configuration? Insufficient hardware? What was the root cause of Postgres primary removing the WAL segments?

评论 #13621814 未加载

评论 #13621217 未加载

评论 #13630525 未加载

dancryer大约 8 年前

Can't help but notice that the new backup monitoring tool suggests that the latest PGSQL backup is almost six days old...Is that correct? <a href="http://monitor.gitlab.net/dashboard/db/backups?from=1485941950082&to=1486978750114" rel="nofollow">http://monitor.gitlab.net/dashboard/db/backups?from=14859419...</a>

nierman超过 8 年前

yes, wal archiving would have helped (archive_command = rsync standby ...), but it's also very easy in postgres 9.4+ to add a replication slot on the master so that wal is kept until it is no longer needed by the standby. simply reference the slot in the standby's recovery.conf file.definitely monitor your replication lag--or at least disk usage on the master--with this approach (in case wal starts piling up there).

nstj超过 8 年前

@sytse were you in contact with MS/Azure during the restore? If so did they offer any assistance, e.g in speeding up restoration disk speed etc

Achshar大约 8 年前

Does anyone have a link to the YouTube stream they's talking about? Can't seem to find it on their channel. And the link in the doc is redirecting to the live link [1] which doesn't list the stream.[1] - <a href="https://www.youtube.com/c/Gitlab/live" rel="nofollow">https://www.youtube.com/c/Gitlab/live</a>

评论 #13623492 未加载

评论 #13637067 未加载

jsingleton大约 8 年前

TIL GitLab runs on Azure. If your CI servers or deployment targets are also on Azure then the latency should be pretty low (assuming you get the correct region). Good to know.I moved from AWS to Azure years ago. Mainly because I run mostly .NET workloads and the support is better. I've recently done some .NET stuff on AWS again and am remembering why I switched.

AlexCoventry大约 8 年前

Thank you for this informative postmortem and mitigation outline.Are any organizational changes planned in response to the development friction which led to the outage? It seems to have arisen from long-standing operational issues, and an analysis of how prior attempts to address those issues got bogged down would be very interesting.

oli5679超过 8 年前

I found this entertaining, even if they did later admit that it was a hoax:<a href="http://serverfault.com/questions/587102/monday-morning-mistake-sudo-rm-rf-no-preserve-root" rel="nofollow">http://serverfault.com/questions/587102/monday-morning-mista...</a>

tschellenbach超过 8 年前

Shouldn't the conclusion of this post mortem be a move to a managed database service like RDS? The database doesn't sound huge, RDS is affordable enough, sounds to me that you spend less money and have better uptime and sleep by moving away from this in-house solution.

评论 #13620562 未加载

khazhou大约 8 年前

Every internal Ops manual needs to begin with the simple phrase: DON'T PANIC

grhmc超过 8 年前

Thank you for redacting who the engineer was. Great write-up. Thank you!

评论 #13620303 未加载

encoderer大约 8 年前

If you want to up your cron job monitoring game there's a link in my profile.

dustinmoris大约 8 年前

Watching GitLab is somewhat painful. I feel like they make every possible mistake you could do as an IT startup and because they are transparent about it people seem to love the fact that they screw up all the time. I don't know if I share the same mentality, because at the end of the day I don't trust GitLab even with the simplest task, let alone any valuable work of mine.It's good to be humble and know that mistakes can happen to anyone and learn from it, etc., but when you do in 2017 still the same stupid mistakes that people did a million times since 1990 and it's all well documented and there's systems built to avoid these same basic mistakes and you still do them today then I just think it cannot be described any different than absolute stupidity and incompetence.I know they have many fans who just look past every mistake no matter how bad it was only because they are open about it, but common, this is now just taking the piss no?

评论 #13622188 未加载

cookiecaper大约 8 年前

I really hate to pile on, but after reading through this whole thread and the whole post-mortem, there are a few basic things that are troubling besides the widely-acknowledged backup methodology. I don't see issues directly related to addressing these things.1. notifications go through regular email. Email should be only one channel used to dispatch notifications of infrastructure events. Tools like VictorOps or PagerDuty should be employed as notification brokers/coordinators and notifications should go to email, team chat, and phone/SMS if severity warrants, and have an attached escalation policy so that it doesn't all hinge on one guy's phone not being dead.2. there was a single database, whose performance problems had impacted production multiple times before (the post lists 4 incidents). One such performance problem was contributing to breakage at this very moment. I understand that was the thing that was trying to be fixed here, but what process allowed this to cause 4 outages over the preceding year without moving to the top of the list of things to address? Wouldn't it be wise to tweak the PgSQL configuration and/or upgrade the server before trying to integrate the hot standby to serve some read-only queries? And since a hot standby can only service reads (and afaik this is not a well-supported option in PgSQL), wouldn't most of the performance issues, which appear write-related, remain? The process seriously needs to be reviewed here.And am I reading this right, the one and only production DB server was restarted to change a configuration value in order to try to make pg_basebackup work? What impact did that have on the people trying to use the site a) while the database was restarting, and b) while the kernel settings were tweaked to accommodate the too-high max_connections value? Is it normal for GitLab to cause intermittent, few-minute downtimes like that? Or did that occur while the site was already down?3. Spam reports can cause mass hard deletion of user data? Has this happened to other users? The target in this instance was a GitLab employee. Who has been trolled this way such that performance wasn't impacted? What's the remedy for wrongly-targeted persons? It's clear that backups of this data are not available. And is the GitLab employee's data gone now too? How could something so insufficient have been released to the public, and how can you disclose this apparently-unresolved vulnerability? By so doing, you're challenging the public to come and try to empty your database. Good thing you're surely taking good backups now! (We're going to glance over the fact that GitLab just told everyone its logical DB backups are 3 days behind and that we shouldn't worry because LVM snapshots now occur hourly, and that it only takes 16 hours to transfer LVM snapshots between environments :) )4. the PgSQL master deleted its WALs within 4 hours of the replica "beginning to lag" (<interrobang here>). That really needs to be fixed. Again, you probably need a serious upgrade to your PgSQL server because it apparently doesn't have enough space to hold more than a couple of hours of WALs (unless this was just a naive misconfiguration of the [min|max]_wal_size parameter, like the max_connections parameter?). I understand that transaction logs can get very large, but the disk needs to accommodate (usually a second disk array is used for WALs to ease write impact) and replication lag needs to be monitored and alarmed on.There were a few other things (including someone else downthread who pointed out that your CEO re-revealed your DB's hostnames in this write-up, and that they're resolvable via public DNS and have running sshds on port 22), but these are the big standouts for me.P.S. bonus point, just speculative:Not sure how fast your disks were, but 300GB gone in "a few seconds" sounds like a stretch. Some data may've been recoverable with some disk forensics. Especially if your Postgres server was running at the time of the deletion, some data and file descriptors also likely could've been extracted from system memory. Linux doesn't actually delete files if another process is holding their handle open; you can go into the /proc virtual filesystem and grab the file descriptor again to redump the files to live disk locations. Since your database was 400GB and too big to keep 100% in RAM, this probably wouldn't have been a full recovery, but it may have been able to provide a partial.The theoretically best thing to do in such a situation would probably be to unplug the machine ASAP after ^C (without going through formal shutdown processes that may try to "clean up" unfinished disk work), remove the disk, attach it to a machine with a write blocker, and take a full-disk image for forensics purposes. This would maximize the ability to extract any data that the system was unable to eat/destroy.In theory, I believe pulling the plug while a process kept the file descriptor open should keep you in reasonably good shape, as far as that goes after you've accidentally deleted 3/4 of your production database. The process never closes and the disk stops and the contents remain on disk, just pending unlink when the OS stops the process (this is one reason why it'd be important to block writes to the disk/be extremely careful while mounting; if the journal plays back, it may destroy these files on the next boot anyway). But someone more familiar with the FS internals would have to say definitively if it works that way or not.I recognize that such speculative/experimental recovery measures may have been intentionally forgone since they're labor intensive, may have delayed the overall recovery, and very possibly wouldn't have returned useful data anyway. Mentioning it mainly as an option to remain aware of.

评论 #13622900 未加载

评论 #13622315 未加载

NPegasus超过 8 年前

<pre><code> > Root Cause Analysis > [...] > [List of technical problems] </code></pre> No, the root cause is you have no senior engineers who have been through this before. A collection of distributed remote employees, none of whom has enough experience to know any of the list of "Basic Knowledge Needed to Run a Website at Scale" that you list as the root causes. $30 million in funding and still running the company like a hobby project among college roommates.Mark my words, the board members from the VC firms will be removed by the VC partners due to letting the kids run the show. Then VC firms will put an experienced CEO and CTO in place to clean up the mess and get the company on track. Unfortunately they will probably have wasted a couple years and be down to the last million $ before they take action.

评论 #13620699 未加载

评论 #13620623 未加载

评论 #13620993 未加载

评论 #13621016 未加载

评论 #13620689 未加载

评论 #13620804 未加载

评论 #13621027 未加载

评论 #13620803 未加载

评论 #13622978 未加载

评论 #13620680 未加载

评论 #13620532 未加载

EnFinlay超过 8 年前

Most destructive troll ever.

评论 #13621492 未加载