TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The Consul outage that never happened

100 pointsby dankohn1over 5 years ago

8 comments

zedpmover 5 years ago
&gt; There is just no substitute for understanding how everything works.<p>Great line, and one I strongly agree with. I love the tooling we have available today like Terraform, Ansible, etc. but the experience I gained by keeping bare metal servers alive and happy with nothing more than a shell has undoubtedly made me a much better admin.
评论 #21557819 未加载
awinderover 5 years ago
This was a fascinating read but I feel like I’m left with some more questions about what’s going on under the hood in gitlab cloud:<p>1. I thought gitlab was hosted in google cloud, but there’s a lot of references to e.g., a hand rolled consensus system and self-managed database clusters. I’m wondering if this event changes the math on build vs. buy at all for gitlab, sounds like a lot of money has gone into this solution. How did that solution come up, is it about some specific Postgres features that aren’t available in googles hosted dbs or pricing?<p>2. Again on the google cloud angle, why are servers being hand-managed and rebooted? Elasticity in the cloud would make me think that the safest option would be to stand up parallel infrastructure (like in a DR plan) and migrate traffic. Was this just about speed of solution rollout? Does gitlab have plans to harden DR plans so that you can execute in cases like this? Whenever someone says they’re “in the cloud” and yet unable to treat servers like cattle, I get a bit worried.
评论 #21556262 未加载
评论 #21557449 未加载
评论 #21556477 未加载
org3432over 5 years ago
&gt; After looking everywhere, and asking everyone on the team, we got the definitive answer that the CA key we created a year ago for this self-signed certificate had been lost.<p>The GitLab outages always make the company seem disorganized and sloppy, and unable to reflect on how to improve how they work. So they don&#x27;t have a central place to store their CA, and even after an outage, did they improve anything about how they work?<p>It&#x27;s ironic that the post seems geared towards recruiting, though I guess it&#x27;s honest, you know what you&#x27;re getting into with that team.
评论 #21556538 未加载
评论 #21557082 未加载
caleblloydover 5 years ago
&gt; It is maintained by the Infrastructure group, which currently consists of 20 to 24 engineers (depending on how you count)<p>20 if you count in Base-12 and 24 if you count in Base-10?
kitotikover 5 years ago
It blows my mind they didn’t have sane PKI with that many resources. It seems like even the “small” initial team of a couple devs, a manager, <i>and a director</i> would’ve at least spun up a vault instance to use as a CA.<p>Also, easy for me say from the peanut gallery, but don’t understand why they couldn’t have done rolling consul restarts to update the configs, I’ve done this many times on consul clusters.
评论 #21557780 未加载
评论 #21557279 未加载
pronoiacover 5 years ago
Consul issues have bitten me at two companies, and I heard word of it being the culprit for some serious outages elsewhere. One possible takeaway here is to remove it.
评论 #21558741 未加载
评论 #21557029 未加载
a2techover 5 years ago
If they were going through all this trouble and worry why not create a new CA and drop the crrts from it on the hosts? That’s the work of just a few minutes (plus some bash scripting to mass generate your host certs). If they had already accepted that they were going to restart the services on all the hosts anyway it would have saved them having to restart all the services again in the future when they need to drop more certs.
评论 #21558696 未加载
sisciaover 5 years ago
Maybe I am saying something stupid, but infrastructure services should be able to use a dynamic set of keys.<p>If the first doesn&#x27;t work, you try the second, and then the third and so so.<p>Similarly the clients, we should be able to dynamically adds certificate.<p>Our own key expires and our services are all about to drop connection seems something that should not happen.