Fly.io Status – Consul cluster outage

126 pointsby purututuabout 2 years ago

19 comments

mrkurtabout 2 years ago

This has been a rough week, and I'm sorry we broke peoples' apps. We had a big Nomad outage on Monday, and then a suspiciously similar Consul outage today. Both tipped over faster than we could detect and mitigate, and we ended up having to do serious surgery to build entirely new Consul/Nomad clusters.There's nothing to brag about here, I just wanted to let y'all know we're listening (even when things aren't on the HN front page).

评论 #35177095 未加载

评论 #35176896 未加载

suryaoabout 2 years ago

Fly is building everything in hard mode - since they are not layering on top of an existing cloud like pretty much everyone else (heroku, render, railway, ...).It's either very smart (if they pull it off) because they will have a ginormous cost advantage or they fail.I'm personally of the opinion that the ux on top of aws/gcp/... is worse than a doo-doo in a shoe. However, they are as stable as can be (all complex systems go down once in a while). There are very few mature projects that do not rely on aws/gcp/... managed services anyway. Might as well put in the little bit of effort to set yourself up for the future instead of painful migrations. This obviously doesn't hold for hobby projects.In any case, I have a lot of respect for the engineering that fly does. Kudos.

评论 #35176852 未加载

luhnabout 2 years ago

Relevant: "Reliability: It's not great" from last week <a href="https://news.ycombinator.com/item?id=35044516" rel="nofollow">https://news.ycombinator.com/item?id=35044516</a>They even specifically call out Consul as a source of trouble.> We propagate app instance and health information across all our regions. That’s how our proxies know where to route requests, and how our DNS servers know what names to give out.> We started out using HashiCorp Consul for this. But we were shoehorning Consul, which has a centralized server model design for individual data center deployments, into a global service discovery role it wasn’t suited for. The result: continuously stale data, a proxy that would route to old expired interfaces, and private DNS that would routinely have stale entries.

评论 #35180705 未加载

markthethomasabout 2 years ago

Been a fan of fly and have had most, if not all, of my side and semi-side projects on there for some time now. But...the ratio of good/fun/snarky blog posts to reliable service has gotten a bit too large for me, starting to look for other providers at this point just in case they can't turn this trend around. Honestly been a good object lesson for me in the importance of backing up marketing/hype/"mind-share" stuff w/ absolute rock-solid performance/reliability or just forgoing the former for the latter.As an aside, it's also taking down some decently-load-bearing web infra like unpkg => <a href="https://www.unpkg.com/" rel="nofollow">https://www.unpkg.com/</a>see also <a href="https://community.fly.io/t/app-went-dead/11397/60">https://community.fly.io/t/app-went-dead/11397/60</a>

评论 #35176571 未加载

评论 #35178572 未加载

评论 #35176511 未加载

paweldudaabout 2 years ago

I really really wanted to like and recommend fly.io but I wouldn't risk deploying anything more than a side project to tinker with, given how many random issues I encountered in a relatively short development time. It was a simple Phoenix app which made me wonder "am I doing things totally wrong?" quite a few times, after exhausting all info sources. But when I tried the same process the next day, it would deploy just fine. Plus the outages that appear to be getting more frequent don't make me optimistic.At least they're transparent about their issues, gotta give them that. I still kinda root for them, maybe they'll make a comeback.

评论 #35176676 未加载

drewbug01about 2 years ago

I love this update:“ We are working to build a new Consul cluster with 10x the RAM. We aren't yet sure, but believe a routine DNS change might have created a thundering herd problem causing Consul servers to immediately increase RAM usage by 500%. This is not ideal.”_This is not ideal._

gzer0about 2 years ago

Interestingly, Roblox went down for 73 hours due to a "unique" issue with Consul as well [1].Great read on how the issue was approached, handled, and ultimately remediated.[1] <a href="https://blog.roblox.com/2022/01/roblox-return-to-service-10-28-10-31-2021/" rel="nofollow">https://blog.roblox.com/2022/01/roblox-return-to-service-10-...</a>

评论 #35176739 未加载

评论 #35177776 未加载

felixdingabout 2 years ago

Was affected by the outage. Didn't know about it so I thought it was just another crash on Fly.io.Tried to restart our app from the command line, only to be told they had disabled the API. And there is no restart feature on their dashboard. So all I could do was watching flyio logs telling me that our apps were down.Sigh.We moved from Heroku to Fly.io only this January, and are already considering moving away from it. The reliability is miserable at best. And so many basic features are missing. Yes it's much cheaper than Heroku, but we ended up paying much more time/resource/money dealing with its glitches. Defeats the purpose why we used a PaaS in the first place.

评论 #35176775 未加载

satvikpendemabout 2 years ago

At this point I'm not sure why one wouldn't use something like Hetzner and slap Coolify or Dokku or something else on it.

评论 #35177045 未加载

评论 #35177206 未加载

评论 #35177330 未加载

评论 #35179471 未加载

kbumsikabout 2 years ago

I have seen some issues around Consul these days.As a person with no background in distributed systems, I am wondering why people choose Consul over alternatives. Are there features that etcd doesn't offer?

评论 #35176897 未加载

评论 #35176900 未加载

throwaway3838gabout 2 years ago

I attempted to deploy a simple app on Fly a couple of weeks ago, but porting it from heroku became a nightmare, servers crashing, cryptic error messages, etc. Maybe I'm in the minority but in any case my experience with Fly definitely left me questioning the hype around it.

评论 #35176887 未加载

HL33tibCe7about 2 years ago

Respect to anybody who is an SRE at fly.io. Couldn’t pay me enough to do that job

评论 #35176493 未加载

评论 #35176368 未加载

sergiomatteiabout 2 years ago

I’m rooting for Fly. I use them myself for a project, and love the service.However, their transparency into outages and service rough edges is a double-edged sword: they’re building a reputation for unreliable software. It’s a shame to see this major outage happen right after last week’s post, it almost confirms the stereotype.However, even with these flaws, I still think they’re building the best hosting out there. They’re taking bold risks and doing what others aren’t. I wish them the best.

评论 #35180298 未加载

pm90about 2 years ago

> We are working to build a new Consul cluster with 10x the RAM.Oh boy. I wouldn’t wanna be the people doing this. Working with infrastructure is hard. Doing it under tight SLAs? Ugh. I really hope the people working on this are being well supported.

评论 #35176877 未加载

评论 #35176890 未加载

js4everabout 2 years ago

That's the issue with centralized infra... I expect it to be less and less stable the more customers they have. I still wish them good luck.On my side I took the opposite direction, each workload is shared nothing.

Thaxllabout 2 years ago

They seem to have a lot of issues with Consul, is it the design of Consul or the way they use it that is the problem?

评论 #35176508 未加载

评论 #35176826 未加载

评论 #35176565 未加载

评论 #35176323 未加载

评论 #35176402 未加载

评论 #35176972 未加载

simonwabout 2 years ago

"This impacts queries to our API, including creating and modifying apps, as well as incoming network requests for recently deployed apps."Would be really interested to understand why it affects recently deployed apps but not apps that are already established - something to do with how the Fly Router works?

评论 #35176868 未加载

评论 #35176875 未加载

pa7chabout 2 years ago

From my experience etcd would have been a better choice for maturity if they don't need the gossip stuff.

beoberhaabout 2 years ago

This shit is hard. Running a cloud service at one of the Big 3 is hard, I can’t imagine doing it with such a small team with your own infra.