Quite beefy hardware for on-prem. Perhaps someone could explain to me why 30k users, even assuming concurrent users would be an issue for hardware that size?<p>Is the app stack naturally resource heavy or is this setup particular different to how a instance should be?
Not sure why Ruby on Rails is taking a beating in the comments section here. The problem is clearly the 1Gbps network that is functioning at only 200Mbps and worn out/defective SSDs. Waiting around on IO all day will bring any stack to a crawl.
This is a superb write-up of an intense, exhausting situation. Great mixture of low-level detail and tactics, and high-level thinking about systems and people. Congratulations on managing that migration, and thank you for sharing this with us!
Confused as to why they didn't just replace the bad SSDs with good ones?<p>Fwiw this sounds to me like what happens when you use "retail" SSDs (drives marketed for use in user laptops) underneath a high write traffic application such as a relational database. Often such drives will either wear out or will turn out to have pathological performance characteristics (they do something akin to GC eventually), or they just have firmware bugs. Use enterprise rated drives for an application like this.
Hetzner is great, but it may not be the best choice for a social network that hosts user content and may attract controversy.<p>As a mass-market hosting provider, Hetzner is subject to constant fraud, abuse and hacked customer servers, and in consequence, their abuse department is very trigger-happy and will usually shoot first and ask questions later. They can and will kick out customers that cause too much of a headache, regardless of their ToS.<p>Their outbound DDoS detection systems are very sensitive and prone to false positives, such as when you get attacked yourself and the TCP backscatter is considered a portscan. If the system is sufficiently confident that you are doing something bad, it automatically firewalls off your servers until you explain yourself.<p>Likewise, inbound abuse reports sometimes lead to automated or manual blocks before you can respond to them.<p>They also rate limited or blocked entire port ranges in the past to get rid of Chia miners and similar miscreants with no regards to collateral damage to other services and without informing their other customers.<p>Their pricing is good and service is otherwise excellent, and if you do get locked out, you can talk to actual humans to sort it out. But, only after the damage is already done. If you use them, have a backup plan.
As someone who scaled ruby on rails in the prime era 2007-2009 I'll tell you the problems have not changed. It's very straightforward horizontal scaling followed by load balancing across multiple nodes. Load relates having enough cores, fast enough disks and enough egress bandwidth throughput. Everything else is purely caching in front of a poorly performing ruby web server and minimising disk or database reads.<p>The write up is cool. Reminiscent of things we used to do back in that early rails 2-3 era. Just funny we're back where we started.<p>TLDR: if you want to run ruby on rails on bare metal be ready to run something with 8+ cores, 10k rpm disks minimum and more bandwidth than you can support out of your basement.
Weak technology stack and deeply flawed concept of federation that enables local centralization of control by discord-mods-meme style with all the corresponding issues.<p>Mastodon should have been based on DHT with each "terminal" aka "profile" having much higher autonomy.<p>Otherwise, it just gives more tools to people who left Twitter to continue doing same societal damage.<p>p.s.: it is time to stop writing back-ends in Ruby when every other popular alternative (sans Python-based ones) is more powerful and scalable.
These comments make me want to log off for a bit.<p>Post: we hit scaling issues caused by our failing disks and running image hosting and databases over NFS<p>HN: It’s obviously Ruby on Rails fault
30,000 users seems like a ludicrously small number of users to hit scaling problems. It sounds like Mastadon has not been designed for scale from the ground up, which is surprising for a project that hopes to be a popular social network.