I've said it before and I'll say it again: expiring certs are the new DNS for outages.<p>I still marvel at just how good Tailscale is. I'm a minor user really but I have two sites that I use tailscale to access: a couple of on-prem servers and my AWS production setup.<p>I can literally work from anywhere - had an issue over the weekend where I was trying to deploy an ECS container but the local wifi was so slow that the deploy kept timing out.<p>I simply SSH'd over to my on-prem development machine, did a git pull of the latest code and did the deploy from there. All while remaining secure with no open ports at all on my on-prem system and none in AWS. Can even do testing against the production Aurora database without any open ports on it, simply run a tailscale agent in AWS on a nano sized EC2.<p>Got another developer you need to give access to your network to? Tailscale makes that trivial (as it does revoking them).<p>Yeah, for that deployment I could just make a GitHub action or something and avoid the perils of terrible internet, but for this I like to do it manually and Tailscale lets me do just that.
Expiring certs strikes again!<p>I'd recommend as part of the post mortem to move their install script off their marketing site or putting in some other fallback so marketing site activity is unrelated to customer operations critical path. They're almost there for maintaining that typical isolation, which helps bc this kind of thing is common.<p>We track uptime of our various providers, and seeing bits like the GitHub or Zendesk sites go down is more common than we expected... and they're the good cases.
They made the same mistake we did at a former company — put a link to our webapp’s login page (app.foo.com) on the marketing site (www.foo.com) homepage.<p>It wasn’t until our first marketing web site outage that we realised that our $40/mo hosting plan was not merely hosting a “marketing site” but rather critical infrastructure. That was a load-bearing $40 hosting plan. Our app wasn’t down but the users thought it was.<p>I learned then that users follow the trails you make for them without realising there are others, and if you take one away then a segment of your user base be completely lost.
I really like these guys, I wish their pricings wasn't so ridiculous. proper access control shouldn't cost 18 bucks a month for a vpn, it's basically unsellable to management at that price and the lower tiers are unsellable without it.
I wonder what provider they use for their website. Sounds like a lot of hoops to jump through for IPV6 when just about any other provider has IPv6 support.
Wow, mad jelly their CI/CD and monitoring proceses are robust enough to trust a major rollout in December. That's a pretty badass eng culture<p>That being said, still some unanswered questions:<p>- If the issue was ipv6 configuration breaking automated cert renewals for ipv4, wouldn't they have hit this like.. a long time ago? Did I miss something here?<p>- Why did this take 90 minutes to resolve? I know it's like a blog post and not a real post-mortem, but some kind of timeline would have been nice to include in the post.<p>- Why not move to DNS provider that natively supports ipv6s?<p>Also I'm curious if it's worth the overhead to have a dedicated domain for scripts/packages? Do other folks do this? (excluding third-parties like package repositories).
Why does the proxy need to terminate TLS? If it were just a TCP proxy, then at least the monitoring wouldn't have been fooled into thinking the certificate wasn't about to expire.<p>Heck, a TCP proxy might even allow automatic renewal to work if the domain validation is being done using a TLS-ALPN challenge.
Anything even remotely security adjacent that TailScale as an institution even remotely fumbles even once is too dangerous for the merely mildly paranoid (like me for example).<p>We need a better story on this.
They have monitoring for their infrastructure, right? Add 50 lines of code that connects to all public domains on ipv4 and ipv6 and alerts if the cert expires in under 19 days. Set automatic renewal to happen 20 days out. Done.
I wrote this code years ago, after missing a couple ssl renewals in the early days of our small company. Haven’t had an ssl-related outage since.<p>Edit: this is the only necessary fix, no need for calendar invites:<p>> We also plan to update our prober infrastructure to check IPv4 and IPv6 endpoints separately.
> That arrangement is deemed a “misconfiguration” by that provider, and we’ve been receiving alerts about it since rolling it out<p>So 90 days of alerts about certs, and then certs fail?
“That means the root issue with renewal is still a problem, and we plan to address it in the short term much like our ancestors did: multiple redundant calendar alerts and a designated window to manually renew the certificates ourselves”.
The conclusion is hilarious: "we plan to address it in the short term much like our ancestors did: multiple redundant calendar alerts and a designated window to manually renew the certificates ourselves"<p>Devops is so 2023. Back to ops!
Two ideas for discussion.<p>Certificate Transparency is used to account for maliciously or mistakenly issued certificates. Perhaps it could also be used to assert the unavailability of correctly issued but obsolete certificates that are believed to be purged but actually aren't. (Services like KeyChest might already do this.)<p>Let's Encrypt is a miracle compared to the expensive pain of getting a cert 20 years ago. Rather than resting on laurels, would there be any benefit to renewing even more frequently, like daily? This might have confined the Tailscale incident to a quick "oops!" while the provider migration was still underway and being actively watched.