This is from 2021 and was discussed then at <a href="https://news.ycombinator.com/item?id=25907312" rel="nofollow">https://news.ycombinator.com/item?id=25907312</a><p>I'm curious what they're doing now.
To overcome the limitations on cluster size in Kubernetes, folks may want to look at the Armada Project ( <a href="https://armadaproject.io/" rel="nofollow">https://armadaproject.io/</a> ). Armada is a
multi-Kubernetes cluster batch job scheduler, and is designed to address the
following issues:<p>A single Kubernetes cluster can not be scaled indefinitely, and managing very large Kubernetes clusters is challenging. Hence, Armada is a multi-cluster
scheduler built on top of several Kubernetes clusters.<p>Achieving very high throughput using the in-cluster storage backend, etcd, is
challenging. Hence, queueing and scheduling is performed partly out-of-cluster
using a specialized storage layer.<p>Armada is designed primarily for ML, AI, and data analytics workloads, and to:<p>- Manage compute clusters composed of tens of thousands of nodes in total.
- Schedule a thousand or more pods per second, on average.
- Enqueue tens of thousands of jobs over a few seconds.
- Divide resources fairly between users.
- Provide visibility for users and admins.
- Ensure near-constant uptime.<p>Armada is written in Go, using Apache Pulsar for eventing, Postgresql, and Redis. A web-based front-end (named "Lookout") provides easy end-user access to see the state of enqueued/running/failed jobs. A Kubernetes Operator to provide quick installation and deployment of Armada is in development.<p>Source code is available at <a href="https://github.com/armadaproject/armada">https://github.com/armadaproject/armada</a> - we welcome
contributors and user reports!
I'm not a huge fan of Kubernetes. However, I think there are some great use cases and undeniably some super intelligent people pushing it to amazing limits.<p>However, after reading over this there are some serious red flags. I wonder if this team even understands what alternatives there are for scheduling at this scale or the real trade offs. It seems like an average choice at best and if I was paying the light bill I'd definitely object to going this route.
>> Pods communicate directly with one another on their pod IP addresses with MPI via SSH<p>It would be nice if someone could solve this problem in a more Kubernetes native way. I.e. here is a container, run it on N nodes using MPI- optimizing for the right NUMA node / GPU configurations.<p>Perhaps even MPI itself needs an overhaul. Is a daemon really necessary within Kubernetes for example?