TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Scaling Kubernetes to 7,500 nodes (2021)

95 pointsby izwasmabout 2 years ago

8 comments

sciurusabout 2 years ago
This is from 2021 and was discussed then at <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=25907312" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=25907312</a><p>I&#x27;m curious what they&#x27;re doing now.
评论 #35176696 未加载
评论 #35179341 未加载
评论 #35177579 未加载
antonchekhovabout 2 years ago
To overcome the limitations on cluster size in Kubernetes, folks may want to look at the Armada Project ( <a href="https:&#x2F;&#x2F;armadaproject.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;armadaproject.io&#x2F;</a> ). Armada is a multi-Kubernetes cluster batch job scheduler, and is designed to address the following issues:<p>A single Kubernetes cluster can not be scaled indefinitely, and managing very large Kubernetes clusters is challenging. Hence, Armada is a multi-cluster scheduler built on top of several Kubernetes clusters.<p>Achieving very high throughput using the in-cluster storage backend, etcd, is challenging. Hence, queueing and scheduling is performed partly out-of-cluster using a specialized storage layer.<p>Armada is designed primarily for ML, AI, and data analytics workloads, and to:<p>- Manage compute clusters composed of tens of thousands of nodes in total. - Schedule a thousand or more pods per second, on average. - Enqueue tens of thousands of jobs over a few seconds. - Divide resources fairly between users. - Provide visibility for users and admins. - Ensure near-constant uptime.<p>Armada is written in Go, using Apache Pulsar for eventing, Postgresql, and Redis. A web-based front-end (named &quot;Lookout&quot;) provides easy end-user access to see the state of enqueued&#x2F;running&#x2F;failed jobs. A Kubernetes Operator to provide quick installation and deployment of Armada is in development.<p>Source code is available at <a href="https:&#x2F;&#x2F;github.com&#x2F;armadaproject&#x2F;armada">https:&#x2F;&#x2F;github.com&#x2F;armadaproject&#x2F;armada</a> - we welcome contributors and user reports!
vvladymyrovabout 2 years ago
Also they use Ray.io from Anyscale <a href="https:&#x2F;&#x2F;archive.ph&#x2F;ZlMi5" rel="nofollow">https:&#x2F;&#x2F;archive.ph&#x2F;ZlMi5</a>
mritsabout 2 years ago
I&#x27;m not a huge fan of Kubernetes. However, I think there are some great use cases and undeniably some super intelligent people pushing it to amazing limits.<p>However, after reading over this there are some serious red flags. I wonder if this team even understands what alternatives there are for scheduling at this scale or the real trade offs. It seems like an average choice at best and if I was paying the light bill I&#x27;d definitely object to going this route.
评论 #35176308 未加载
osigurdsonabout 2 years ago
&gt;&gt; Pods communicate directly with one another on their pod IP addresses with MPI via SSH<p>It would be nice if someone could solve this problem in a more Kubernetes native way. I.e. here is a container, run it on N nodes using MPI- optimizing for the right NUMA node &#x2F; GPU configurations.<p>Perhaps even MPI itself needs an overhaul. Is a daemon really necessary within Kubernetes for example?
rmoreyabout 2 years ago
good read. should probably get [2021] tag
bbarnettabout 2 years ago
Success! Meanwhile, all 7500 nodes are, computationally, replaced by a 96 core, $10k server, in a dude&#x27;s basement.<p>With power to spare.
评论 #35177552 未加载
评论 #35181572 未加载
评论 #35181774 未加载
评论 #35182484 未加载
satvikpendemabout 2 years ago
Is Kubernetes simply BEAM but not on Erlang?