Fly.io outage – resolved

243 点作者 punkpeye6 个月前

27 条评论

benhoyt6 个月前

My fly.io-hosted website went down for 5 minutes (6 hours ago), but then came right back up, and has been up ever since. I use a free monitoring service that checks it every 5 minutes, so it's possible it missed another short bit of downtime. But fly.io has been pretty reliable overall for me!

评论 #42242309 未加载

评论 #42243282 未加载

评论 #42244009 未加载

评论 #42242951 未加载

评论 #42247530 未加载

评论 #42242178 未加载

jart6 个月前

fly.io publishes their post-mortems here: <a href="https://fly.io/infra-log/">https://fly.io/infra-log/</a>The last post-mortem they wrote is very interesting and full of details. Basically back in 2016 the heart or keystone component of fly.io production infrastructure was called consul, which is a highly secure TLS server that tracks shared state and it requires that both the server certificate and the client certificate be authenticated. Since it was centralized, it had scaling issues, so fly.io wrote a replacement for it in 2020 called corrosion, and quickly forgot about consul, but didn't have the heart to kill it. Then in October 2024 consul's root key signing key expires, which brought down all connectivity, and since it uses bidirectional authentication, they couldn't bring it back online until they deployed new SSL certificates to every machine in their fleet. Somehow they did this in half an hour, but the chain of dominoes had already been set in motion to reveal other weaknesses in their infrastructure that they could eliminate. There was this other internal service whose own independent set of TLS keys had also expired long ago, but they didn't notice until they tried rebooting it as part of the consul rekey, since doing so severed the TCP connections it had established way back when its certificate was valid. Plus the whole time this is happening, their logging tools are DDOSing their network provider. It took some real heroes to save the company and all their customers too when that many things explode at once.

评论 #42242770 未加载

cryptos6 个月前

Fly.io seems to be a bit of a mixed bag:<a href="https://news.ycombinator.com/item?id=41917436">https://news.ycombinator.com/item?id=41917436</a><a href="https://news.ycombinator.com/item?id=35044516">https://news.ycombinator.com/item?id=35044516</a><a href="https://news.ycombinator.com/item?id=34742946">https://news.ycombinator.com/item?id=34742946</a><a href="https://news.ycombinator.com/item?id=34229751">https://news.ycombinator.com/item?id=34229751</a>If a cloud platform doesn't really provide reliability, I'd say it's probably not worth it. You could better just rent a (virtual) server and save the cloud tax.

评论 #42244111 未加载

评论 #42245352 未加载

评论 #42243902 未加载

评论 #42251911 未加载

评论 #42248279 未加载

评论 #42244298 未加载

评论 #42248647 未加载

评论 #42246153 未加载

punkpeye6 个月前

Contrary to the title of the post, Fly.io API remains inaccessible. Meaning, users still cannot access deploys/databases, etc.For accurate updates, follow <a href="https://community.fly.io/t/fly-io-site-is-currently-inaccessible/22791/40">https://community.fly.io/t/fly-io-site-is-currently-inaccess...</a>

neya6 个月前

Personal experience between Fly.io and Railway.com - Railway wins for me hands down. I have used both and the Railways support is stellar too, in comparison. Fly.io never responded to my query about data deletion till date. Despite emailing on their support email.I have had my Railway app online till date without any major downtimes too. I recommend anyone looking for a decent replacement to try them.

评论 #42243421 未加载

评论 #42245085 未加载

评论 #42243179 未加载

shubhamjain6 个月前

This is probably 5th or 6th major outage from Fly.io that I have personally seen. Pretty sure there were many others and some just went unnoticed. I recommended the service to a friend, and within two days he faced two outages.Fly.io seriously needs to get it together. Why it hasn’t happened yet is a mystery to me. They have a good product but stability needs to be an absolute top for a hosting service. Everything else is secondary.

评论 #42242255 未加载

评论 #42242072 未加载

评论 #42242066 未加载

评论 #42242071 未加载

评论 #42242088 未加载

HellsMaddy6 个月前

Suspiciously, Turso started having issues around the same time. Their CEO confirmed on Discord it's due to the Fly outage:> Ok.I caught up with our oncall and This seems related to the Fly.io incident that is reported in our status page. Our login does call things in the Fly.io API> we are already in touch with Fly and will see if we can speed this up

评论 #42242488 未加载

marvin-hansen6 个月前

No surprise. About a year ago, I looked at fly.io because of it's low pricing and I was wondering where they were cutting corners to still make some money. Ultimately, I found the answer in their tech docs where it was spelled out clearly that an fly instance is hardwired to one physical server and thus cannot fail over in case that server dies. Not sure if that part still is in the official documentation.In practice, that means if a server goes down, they have to load the last snapshot from that instance from the Backup and push it on a new server, update the network path, and pray to god that not more server fail than spare capacity is available. Otherwise you have to wait for a restore until the datacenter mounted a few more boxes in the rack.That explains quite a bit the randomness of those outage reports i.e. my app is down vs the other is fine and mine came back in 5 minutes vs the other took forever.As a business on a budget, I think anything else i.e. a small civo cluster serves you better.

评论 #42242470 未加载

评论 #42242478 未加载

评论 #42242532 未加载

评论 #42242420 未加载

xyst6 个月前

Recurring pattern I notice is outages tend to occur the week of major holidays in US.- MS 365/Teams/Exchange had a blip in the morning- Fly.io with complete outage- then a handful of sites and services impacted due to those outagesUsually advocate against “change freezes” but I think a change freeze around major holidays makes sense. Give all teams a recharge/pause/whatever.Don’t put too much pressure on the B-squads that were unfortunate to draw the short stick.

评论 #42242744 未加载

评论 #42242669 未加载

评论 #42243052 未加载

评论 #42242668 未加载

评论 #42242662 未加载

akshayshah6 个月前

The series of outages early in 2023 also had some Corrosion-related pain: <a href="https://community.fly.io/t/reliability-its-not-great/11253">https://community.fly.io/t/reliability-its-not-great/11253</a>

评论 #42242732 未加载

arusahni6 个月前

Oof, hugops to the team.

stevefan19996 个月前

Yep...can confirm my self hosted Bitwarden there is completely FUBAR connection wise even if it is in EA, so it should be a worldwide outage...lemme guess, some internal tooling error, consensus split brain, or if it looks like someone leaked BGP routes again?

评论 #42242602 未加载

评论 #42242304 未加载

评论 #42242032 未加载

redslazer6 个月前

fly.io just has the weirdest outages. It has issues so regularly we dont even need to run mock outages to make sure our system fail overs work.

评论 #42242078 未加载

评论 #42242169 未加载

评论 #42242318 未加载

teaearlgraycold6 个月前

I'm grateful to HN for keeping me well aware of Fly's issues. I'll never use them.

评论 #42242268 未加载

punkpeye6 个月前

It is not reflected in their status page, but fly.io itself is not even loading.

评论 #42242315 未加载

评论 #42242317 未加载

评论 #42242098 未加载

MaxfordAndSons6 个月前

Kinda funny that they've named their global state store "Corrosion"... not really a word I'd associate with stability and persistence.

评论 #42242180 未加载

评论 #42242683 未加载

评论 #42242106 未加载

评论 #42242337 未加载

mattbee6 个月前

It feels like fly is trying to repeat a growth model that worked 20 years ago: throw interesting toys at engineers, then wait for engineers to recommend their services as they move on in their careers.Part of that playbook is the old Move Fast & Break Things. That can still be the right call for young projects, but it has two big problems:1) AWS successfully moved themselves into the position of "safe" hosting choice, so it's much rarer for engineers to have influence on something that's seen by money men as a humdrum, solved problem;2) engineers are not the internal influencers they used to be, being laid off left and right the last few years, and without time for hobby projects.(maybe also 3) it's much harder to build a useful free tier on a hosting service, which used to be a necessary marketing expense to reach those engineers).So idk, I feel like the bar is just higher for hosting stability than it used to be, and novelty is a much harder sell, even here. Or rather: if you're going to brag about reinventing so many wheels, they need to not to come off the cart as often.

xyst6 个月前

I can’t even login to my old account. Password reset is timing out yet still receive password reset e-mail. Password reset link broken, with 500 status code.

DataOverload6 个月前

We switched from Fly to CF workers a while ago, and never looked back

评论 #42242296 未加载

评论 #42242313 未加载

评论 #42242462 未加载

评论 #42242301 未加载

评论 #42242320 未加载

评论 #42242288 未加载

gigapotential6 个月前

HUGOPSEverything is going to be 200 OK!

mrcwinn6 个月前

I tried Fly early. I was very excited about this service, but I've never had a worse hosting experience. So I left. Coincidentally I tried it again a few days ago. Surely things must be better. Nope. Auth issues in the CLI, frustrations deploying a Docker app to a Fly machine. I wouldn't recommend it to anyone.

评论 #42242440 未加载

pier256 个月前

My apps on Fly have not gone down this time.

EGreg6 个月前

What exactly does flyio.net do?

评论 #42242275 未加载

评论 #42242283 未加载

评论 #42242381 未加载

评论 #42242265 未加载

评论 #42242298 未加载

Huppie6 个月前

It's interesting to see this discussion about fly.io's reliability on a day that (after over three days of downtime) Microsoft Azure finally decided the update of Azure Static Web Apps they deployed last Friday is indeed broken for customers using specific authentication settings......with not a single status update from Microsoft in sight.

theideaofcoffee6 个月前

Color me not surprised. My few interactions with people there just gave off the impression of them being in a bit over their heads. I don't know how well that translated to their actual ops, but it's difficult to not connect the two when they continue to have major outage after major outage for a product that 'should' be their customer's bedrock upon which they build everything else.

评论 #42242794 未加载

travisgriggs6 个月前

Don’t a bunch of Elixir/Erlang guys work at fly.io? It’s weird to me that that hallmark of reliability is associated with something that the public sees as unreliable. What gives with that association?

veggieWHITES6 个月前

I was considering these guys the other day until I saw their pricing page: <a href="https://fly.io/pricing/">https://fly.io/pricing/</a>(There's not a single price on there, why even create the page?)

评论 #42242159 未加载

评论 #42243555 未加载

评论 #42242150 未加载

评论 #42242142 未加载

27 条评论

benhoyt6 个月前

评论 #42242309 未加载

评论 #42243282 未加载

评论 #42244009 未加载

评论 #42242951 未加载

评论 #42247530 未加载

评论 #42242178 未加载

jart6 个月前

评论 #42242770 未加载

cryptos6 个月前

评论 #42244111 未加载

评论 #42245352 未加载

评论 #42243902 未加载

评论 #42251911 未加载

评论 #42248279 未加载

评论 #42244298 未加载

评论 #42248647 未加载

评论 #42246153 未加载

punkpeye6 个月前

neya6 个月前

评论 #42243421 未加载

评论 #42245085 未加载

评论 #42243179 未加载

shubhamjain6 个月前

评论 #42242255 未加载

评论 #42242072 未加载

评论 #42242066 未加载

评论 #42242071 未加载

评论 #42242088 未加载

HellsMaddy6 个月前

评论 #42242488 未加载

marvin-hansen6 个月前

评论 #42242470 未加载

评论 #42242478 未加载

评论 #42242532 未加载

评论 #42242420 未加载

xyst6 个月前

评论 #42242744 未加载

评论 #42242669 未加载

评论 #42243052 未加载

评论 #42242668 未加载

评论 #42242662 未加载

akshayshah6 个月前

评论 #42242732 未加载

arusahni6 个月前

Oof, hugops to the team.

stevefan19996 个月前

评论 #42242602 未加载

评论 #42242304 未加载

评论 #42242032 未加载

redslazer6 个月前

fly.io just has the weirdest outages. It has issues so regularly we dont even need to run mock outages to make sure our system fail overs work.

评论 #42242078 未加载

评论 #42242169 未加载

评论 #42242318 未加载

teaearlgraycold6 个月前

I'm grateful to HN for keeping me well aware of Fly's issues. I'll never use them.

评论 #42242268 未加载

punkpeye6 个月前

It is not reflected in their status page, but fly.io itself is not even loading.

评论 #42242315 未加载

评论 #42242317 未加载

评论 #42242098 未加载

MaxfordAndSons6 个月前

Kinda funny that they've named their global state store "Corrosion"... not really a word I'd associate with stability and persistence.

评论 #42242180 未加载

评论 #42242683 未加载

评论 #42242106 未加载

评论 #42242337 未加载

mattbee6 个月前

xyst6 个月前

I can’t even login to my old account. Password reset is timing out yet still receive password reset e-mail. Password reset link broken, with 500 status code.