This has been a rough week, and I'm sorry we broke peoples' apps. We had a big Nomad outage on Monday, and then a suspiciously similar Consul outage today. Both tipped over faster than we could detect and mitigate, and we ended up having to do serious surgery to build entirely new Consul/Nomad clusters.<p>There's nothing to brag about here, I just wanted to let y'all know we're listening (even when things aren't on the HN front page).
Fly is building everything in hard mode - since they are not layering on top of an existing cloud like pretty much everyone else (heroku, render, railway, ...).<p>It's either very smart (if they pull it off) because they will have a ginormous cost advantage or they fail.<p>I'm personally of the opinion that the ux on top of aws/gcp/... is worse than a doo-doo in a shoe. However, they are as stable as can be (all complex systems go down once in a while). There are very few mature projects that do not rely on aws/gcp/... managed services anyway.
Might as well put in the little bit of effort to set yourself up for the future instead of painful migrations. This obviously doesn't hold for hobby projects.<p>In any case, I have a <i>lot</i> of respect for the engineering that fly does. Kudos.
Relevant: "Reliability: It's not great" from last week <a href="https://news.ycombinator.com/item?id=35044516" rel="nofollow">https://news.ycombinator.com/item?id=35044516</a><p>They even specifically call out Consul as a source of trouble.<p>> We propagate app instance and health information across all our regions. That’s how our proxies know where to route requests, and how our DNS servers know what names to give out.<p>> We started out using HashiCorp Consul for this. But we were shoehorning Consul, which has a centralized server model design for individual data center deployments, into a global service discovery role it wasn’t suited for. The result: continuously stale data, a proxy that would route to old expired interfaces, and private DNS that would routinely have stale entries.
Been a fan of fly and have had most, if not all, of my side and semi-side projects on there for some time now. But...the ratio of good/fun/snarky blog posts to reliable service has gotten a bit too large for me, starting to look for other providers at this point just in case they can't turn this trend around. Honestly been a good object lesson for me in the importance of backing up marketing/hype/"mind-share" stuff w/ absolute rock-solid performance/reliability or just forgoing the former for the latter.<p>As an aside, it's also taking down some decently-load-bearing web infra like unpkg => <a href="https://www.unpkg.com/" rel="nofollow">https://www.unpkg.com/</a><p>see also <a href="https://community.fly.io/t/app-went-dead/11397/60">https://community.fly.io/t/app-went-dead/11397/60</a>
I really really wanted to like and recommend fly.io but I wouldn't risk deploying anything more than a side project to tinker with, given how many random issues I encountered in a relatively short development time. It was a simple Phoenix app which made me wonder "am I doing things totally wrong?" quite a few times, after exhausting all info sources.
But when I tried the same process the next day, it would deploy just fine. Plus the outages that appear to be getting more frequent don't make me optimistic.<p>At least they're transparent about their issues, gotta give them that. I still kinda root for them, maybe they'll make a comeback.
I love this update:<p>“ We are working to build a new Consul cluster with 10x the RAM. We aren't yet sure, but believe a routine DNS change might have created a thundering herd problem causing Consul servers to immediately increase RAM usage by 500%. This is not ideal.”<p>_This is not ideal._
Interestingly, Roblox went down for 73 hours due to a "unique" issue with Consul as well [1].<p>Great read on how the issue was approached, handled, and ultimately remediated.<p>[1] <a href="https://blog.roblox.com/2022/01/roblox-return-to-service-10-28-10-31-2021/" rel="nofollow">https://blog.roblox.com/2022/01/roblox-return-to-service-10-...</a>
Was affected by the outage. Didn't know about it so I thought it was just another crash on Fly.io.<p>Tried to restart our app from the command line, only to be told they had disabled the API. And there is no restart feature on their dashboard. So all I could do was watching flyio logs telling me that our apps were down.<p>Sigh.<p>We moved from Heroku to Fly.io only this January, and are already considering moving away from it. The reliability is miserable at best. And so many basic features are missing. Yes it's much cheaper than Heroku, but we ended up paying much more time/resource/money dealing with its glitches. Defeats the purpose why we used a PaaS in the first place.
I have seen some issues around Consul these days.<p>As a person with no background in distributed systems, I am wondering why people choose Consul over alternatives. Are there features that etcd doesn't offer?
I attempted to deploy a simple app on Fly a couple of weeks ago, but porting it from heroku became a nightmare, servers crashing, cryptic error messages, etc. Maybe I'm in the minority but in any case my experience with Fly definitely left me questioning the hype around it.
I’m rooting for Fly. I use them myself for a project, and love the service.<p>However, their transparency into outages and service rough edges is a double-edged sword: they’re building a reputation for unreliable software. It’s a shame to see this major outage happen right after last week’s post, it almost confirms the stereotype.<p>However, even with these flaws, I still think they’re building the best hosting out there. They’re taking bold risks and doing what others aren’t. I wish them the best.
> We are working to build a new Consul cluster with 10x the RAM.<p>Oh boy. I wouldn’t wanna be the people doing this. Working with infrastructure is <i>hard</i>. Doing it under tight SLAs? Ugh. I really hope the people working on this are being well supported.
That's the issue with centralized infra... I expect it to be less and less stable the more customers they have. I still wish them good luck.<p>On my side I took the opposite direction, each workload is shared nothing.
"This impacts queries to our API, including creating and modifying apps, as well as incoming network requests for recently deployed apps."<p>Would be really interested to understand why it affects recently deployed apps but not apps that are already established - something to do with how the Fly Router works?