TechEcho

9 comments

SEJeffover 7 years ago

This is a really fantastic set of general "how to tune kubernetes and the various components for large clusters". Thanks for writing this up!

drewrobbover 7 years ago

I'm surprised that the scaling story of k8s/(+etcd?) is still so far behind mesos/zk. There have been mesos clusters at over 10k Nodes for several years now.I have never personally needed more than a few hundred mesos agents, but these have been added without any noticeable impact on our extremely modestly provisioned (and multi purpose) zk cluster or any other components.Has anyone used both systems and can speak to any advantages of k8s for these types of workloads?Also is anyone using some kind of torrent approach as a more reasonable solution to avoid network bottlenecks when distributing big docker images to a large number of nodes?

评论 #16185400 未加载

merbover 7 years ago

what I find amazing about k8s is that it's one of the first solution that is relativly simple for a small cluster (HA, while schedule stuff on the masters), but can scale amazingly well even for a big cluster. you can start with 3 nodes with like 8gb per machine (or less, I guess even 2gb is feasible if you only want to use like 1-1,5gb of memory per machine). (non ha can of course be smaller)

评论 #16182596 未加载

roscoebeezieover 7 years ago

As a person who doesn’t understand containers, where do I go to learn the basics?

评论 #16182032 未加载

评论 #16182071 未加载

评论 #16184231 未加载

评论 #16183629 未加载

评论 #16181502 未加载

评论 #16181599 未加载

评论 #16181819 未加载

评论 #16181991 未加载

djb_hackernewsover 7 years ago

350TB of memory, and 50,000 cores, nice.ARP caching seems to be a common issue in cloud environments. AWS recommends turning it off and does so itself in their Amazon Linux distro.

myrandomcommentover 7 years ago

Ran into the ARP scale issues when trying to put 1000 containers on a system for scale testing over year ago. strace helped figure out where the issues was and what settings to change. I guess I should have sent an email to the mailing list. At that time if you searched for scaling to 1000 docker contains was a failed search, as it was "hey here is how I scaled to 1000 containers over X numbers of nodes". No one was crazy enough to try to get 1000 on a single machine.

eggie5over 7 years ago

Does OpenAI train w/ GPUs on k8s clusters?

评论 #16181782 未加载

评论 #16182200 未加载

EDevilover 7 years ago

Isn’t it a problem to have etcd store its state on a non persistent volume?How do they recover it after a restart? I suppose it's not a manual process.

评论 #16185244 未加载

bdburnsover 7 years ago

(Azure containers lead here) Awesome to see OpenAI scale Kubernetes on Azure!

评论 #16180301 未加载

9 comments

SEJeffover 7 years ago

This is a really fantastic set of general "how to tune kubernetes and the various components for large clusters". Thanks for writing this up!

drewrobbover 7 years ago

评论 #16185400 未加载

merbover 7 years ago

评论 #16182596 未加载

roscoebeezieover 7 years ago

As a person who doesn’t understand containers, where do I go to learn the basics?

评论 #16182032 未加载

评论 #16182071 未加载

评论 #16184231 未加载

评论 #16183629 未加载

评论 #16181502 未加载

评论 #16181599 未加载

评论 #16181819 未加载

评论 #16181991 未加载

djb_hackernewsover 7 years ago

350TB of memory, and 50,000 cores, nice.ARP caching seems to be a common issue in cloud environments. AWS recommends turning it off and does so itself in their Amazon Linux distro.

myrandomcommentover 7 years ago

eggie5over 7 years ago

Does OpenAI train w/ GPUs on k8s clusters?

评论 #16181782 未加载

评论 #16182200 未加载

EDevilover 7 years ago

Isn’t it a problem to have etcd store its state on a non persistent volume?How do they recover it after a restart? I suppose it's not a manual process.

评论 #16185244 未加载

bdburnsover 7 years ago

(Azure containers lead here) Awesome to see OpenAI scale Kubernetes on Azure!

评论 #16180301 未加载

Scaling Kubernetes to 2,500 Nodes

9 comments

Scaling Kubernetes to 2,500 Nodes

9 comments