About the Tailscale.com outage on March 7, 2024

245 pointsby tatersolidabout 1 year ago

17 comments

I've said it before and I'll say it again: expiring certs are the new DNS for outages.I still marvel at just how good Tailscale is. I'm a minor user really but I have two sites that I use tailscale to access: a couple of on-prem servers and my AWS production setup.I can literally work from anywhere - had an issue over the weekend where I was trying to deploy an ECS container but the local wifi was so slow that the deploy kept timing out.I simply SSH'd over to my on-prem development machine, did a git pull of the latest code and did the deploy from there. All while remaining secure with no open ports at all on my on-prem system and none in AWS. Can even do testing against the production Aurora database without any open ports on it, simply run a tailscale agent in AWS on a nano sized EC2.Got another developer you need to give access to your network to? Tailscale makes that trivial (as it does revoking them).Yeah, for that deployment I could just make a GitHub action or something and avoid the perils of terrible internet, but for this I like to do it manually and Tailscale lets me do just that.

评论 #39881408 未加载

评论 #39882094 未加载

lmeyerovabout 1 year ago

Expiring certs strikes again!I'd recommend as part of the post mortem to move their install script off their marketing site or putting in some other fallback so marketing site activity is unrelated to customer operations critical path. They're almost there for maintaining that typical isolation, which helps bc this kind of thing is common.We track uptime of our various providers, and seeing bits like the GitHub or Zendesk sites go down is more common than we expected... and they're the good cases.

评论 #39876529 未加载

评论 #39877611 未加载

gnatabout 1 year ago

They made the same mistake we did at a former company — put a link to our webapp’s login page (app.foo.com) on the marketing site (www.foo.com) homepage.It wasn’t until our first marketing web site outage that we realised that our $40/mo hosting plan was not merely hosting a “marketing site” but rather critical infrastructure. That was a load-bearing $40 hosting plan. Our app wasn’t down but the users thought it was.I learned then that users follow the trails you make for them without realising there are others, and if you take one away then a segment of your user base be completely lost.

评论 #39881167 未加载

snapplebobappleabout 1 year ago

I really like these guys, I wish their pricings wasn't so ridiculous. proper access control shouldn't cost 18 bucks a month for a vpn, it's basically unsellable to management at that price and the lower tiers are unsellable without it.

评论 #39877607 未加载

评论 #39877620 未加载

评论 #39881371 未加载

评论 #39880764 未加载

评论 #39885544 未加载

nerdbaggyabout 1 year ago

I wonder what provider they use for their website. Sounds like a lot of hoops to jump through for IPV6 when just about any other provider has IPv6 support.

评论 #39876404 未加载

评论 #39876760 未加载

johnnyAghandsabout 1 year ago

Wow, mad jelly their CI/CD and monitoring proceses are robust enough to trust a major rollout in December. That's a pretty badass eng cultureThat being said, still some unanswered questions:- If the issue was ipv6 configuration breaking automated cert renewals for ipv4, wouldn't they have hit this like.. a long time ago? Did I miss something here?- Why did this take 90 minutes to resolve? I know it's like a blog post and not a real post-mortem, but some kind of timeline would have been nice to include in the post.- Why not move to DNS provider that natively supports ipv6s?Also I'm curious if it's worth the overhead to have a dedicated domain for scripts/packages? Do other folks do this? (excluding third-parties like package repositories).

评论 #39878140 未加载

评论 #39881170 未加载

agwaabout 1 year ago

Why does the proxy need to terminate TLS? If it were just a TCP proxy, then at least the monitoring wouldn't have been fooled into thinking the certificate wasn't about to expire.Heck, a TCP proxy might even allow automatic renewal to work if the domain validation is being done using a TLS-ALPN challenge.

评论 #39879232 未加载

评论 #39877025 未加载

评论 #39878493 未加载

评论 #39878220 未加载

评论 #39878046 未加载

评论 #39876811 未加载

benreesmanabout 1 year ago

Anything even remotely security adjacent that TailScale as an institution even remotely fumbles even once is too dangerous for the merely mildly paranoid (like me for example).We need a better story on this.

physiclesabout 1 year ago

They have monitoring for their infrastructure, right? Add 50 lines of code that connects to all public domains on ipv4 and ipv6 and alerts if the cert expires in under 19 days. Set automatic renewal to happen 20 days out. Done. I wrote this code years ago, after missing a couple ssl renewals in the early days of our small company. Haven’t had an ssl-related outage since.Edit: this is the only necessary fix, no need for calendar invites:> We also plan to update our prober infrastructure to check IPv4 and IPv6 endpoints separately.

eacapeisfutuileabout 1 year ago

> That arrangement is deemed a “misconfiguration” by that provider, and we’ve been receiving alerts about it since rolling it outSo 90 days of alerts about certs, and then certs fail?

评论 #39881538 未加载

re5i5torabout 1 year ago

“like our ancestors did: multiple redundant calendar alerts”Love that line

Scubabear68about 1 year ago

“That means the root issue with renewal is still a problem, and we plan to address it in the short term much like our ancestors did: multiple redundant calendar alerts and a designated window to manually renew the certificates ourselves”.

评论 #39876355 未加载

评论 #39881942 未加载

gigatexalabout 1 year ago

Surely they can automate the renewal? It seems their solution is a manual one. Am I being a simpleton?

评论 #39880322 未加载

aktuelabout 1 year ago

That's why I roll my VPN locally. One less party to worry about.

评论 #39878563 未加载

NelsonMinarabout 1 year ago

The conclusion is hilarious: "we plan to address it in the short term much like our ancestors did: multiple redundant calendar alerts and a designated window to manually renew the certificates ourselves"Devops is so 2023. Back to ops!

评论 #39876577 未加载

评论 #39877631 未加载

评论 #39879633 未加载

评论 #39877466 未加载

PuffinBlueabout 1 year ago

Migrating to a host that doesn’t support IPv6 when it’s important to you seems…like a bad decision.

评论 #39876589 未加载

评论 #39881808 未加载

评论 #39879121 未加载

评论 #39877469 未加载

sowbugabout 1 year ago

Two ideas for discussion.Certificate Transparency is used to account for maliciously or mistakenly issued certificates. Perhaps it could also be used to assert the unavailability of correctly issued but obsolete certificates that are believed to be purged but actually aren't. (Services like KeyChest might already do this.)Let's Encrypt is a miracle compared to the expensive pain of getting a cert 20 years ago. Rather than resting on laurels, would there be any benefit to renewing even more frequently, like daily? This might have confined the Tailscale incident to a quick "oops!" while the provider migration was still underway and being actively watched.

评论 #39876541 未加载

评论 #39877758 未加载