TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

GPUs as a service with Kubernetes Engine are now generally available

81 点作者 rey12rey将近 7 年前

5 条评论

minimaxir将近 7 年前
With the new discounts on preemptible GPUs (<a href="https:&#x2F;&#x2F;cloudplatform.googleblog.com&#x2F;2018&#x2F;06&#x2F;Introducing-improved-pricing-for-Preemptible-GPUs.html" rel="nofollow">https:&#x2F;&#x2F;cloudplatform.googleblog.com&#x2F;2018&#x2F;06&#x2F;Introducing-imp...</a>), the economics of quickly spinning up a fleet of GPUs with Kubernetes for a quick parallelizable ML task become very <i>interesting</i>. (assuming that Google allows enough GPU quota for a fleet of GPUs for nonenterprise users anyways)<p>What I want to use Kubernetes + instant-GPU-fleet for deep learning hyperparameter grid searching. (i.e. spin up a lot of preemptible GPUs; for each parameter config, train the model on a single GPU in parallel for linear scanning speed scaling).<p>Kubeflow (<a href="https:&#x2F;&#x2F;github.com&#x2F;kubeflow&#x2F;kubeflow" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;kubeflow&#x2F;kubeflow</a>) is <i>close</i> to this functionality, but not quite there yet in user-friendlyness. (you have to package everything in a huge Docker container and launch jobs from the CLI; ideally what I want to do is to spawn containers and start training directly from the JupyterHub notebook on the master node)
评论 #17349409 未加载
评论 #17350364 未加载
评论 #17349956 未加载
kozikow将近 7 年前
We have been using GPUs with GKE for a while. At some point, we used 20+ GPUs in an production workflow without any problems.<p>Everything generally works well, maybe except the initial phase when some containers won&#x27;t port well from nvidia-docker-compose due to problems with Cuda libraries. Ideally, you need to match the version of Cuda everywhere.<p>My dev setup for quick experimentation with GPU docker container on GKE: <a href="https:&#x2F;&#x2F;tensorflight.blog&#x2F;2018&#x2F;02&#x2F;23&#x2F;dev-environment-for-gke&#x2F;" rel="nofollow">https:&#x2F;&#x2F;tensorflight.blog&#x2F;2018&#x2F;02&#x2F;23&#x2F;dev-environment-for-gke...</a> .
rjain15将近 7 年前
Are these GPUs on bare metal or virtualized GPUs on VMs
评论 #17349708 未加载
fpgaminer将近 7 年前
On a related note, last week I took a dive into Kubernetes on gcloud for a personal project and came out with some interesting knowledge.<p>First off, this was for a _small_ personal project. Something that I originally intended to run on an f1-micro. I decided to check out Kubernetes mostly to learn, but also to see if it could offer a more maintainable setup (typically I just write a mess of provisioning shell scripts and cloud-init scripts to bring up servers. A bit of a mess to maintain long-term). So basically, I was using Kubernetes &quot;wrong&quot;; its target audience is workloads that intend to use a fleet of machines. But I trudged forward anyway.<p>This resulted in the first problem. You can&#x27;t spin up a Kubernetes cluster with just one f1-micro. Google won&#x27;t let you. I could either do a 3x f1-micro cluster, which would be ~$12&#x2F;month, or 1x f1-small, should would be about the same price. Contrast with my original plan of a single f1-micro, which is ~$4&#x2F;mo. Hmm...<p>Well after playing around I discovered a &quot;bug&quot; in gcloud&#x27;s tooling. You can spin up a 3x f1-micro cluster. Then add a node pool with just one or two f1-micros in it. Then kill the original node pool. This is all allowed, and results in a final cluster with only one or two nodes in it. Nice. &quot;I know what I&#x27;m doing, Google is just being a dick!&quot; I thought. I could still spin up Pods on the cluster, no problem.<p>Then the second discovery. The Kubernetes console was reporting unschedulable system pods. Turns out, Google has a reason for these minimums.<p>All the system pods, the pods that help orchestrate the show, provide logging, metrics, etc; they take up a whopping 700 MB of RAM and a good chunk of CPU as well. I was a bit shocked.<p>I&#x27;m sure most developers are just shrugging right now. 700 MB is nothing these days. But remember, my original plan was a single f1-micro which only has 700 MB. This is a personal project, so every bit counts to keep long-term costs down. And, in deep contrast to Kubernetes&#x27; gluttony, the app I intend to run on this system only uses ~32 MB under usual loads. That&#x27;s right; 32 MB. It&#x27;s a webapp running on a Rust web server.<p>So hopefully you can imagine my shock at Kube&#x27;s RAM usage. As I dug in I discovered most all of the services are built using Go. No wonder. I love Go, but it&#x27;s a memory hog. My mind started imagining what the RAM usage would be like if all these services had been written in Rust...<p>Point being, 700MB exceeds what one f1-micro can handle. And it exceeds what two f1-micros can handle, because a lot of those services are per-node services. Combined with the base RAM usage of the (surprisingly) bloated Container OS that Google runs on the nodes in the cluster. (Spinning up a Container Optimized image on a GCE instance I measured something like 500 or more RAM usage on a bare install...). And hence why Google won&#x27;t let you spin up a cluster of less than three f1-micros. You can, however, use a single f1-small since it has 1.7MB of RAM in a single instance.<p>At this point I resigned myself to just having a cluster of three nodes. <i>shrug</i> the expense of learning, I suppose. And perhaps I could re-use the cluster to host other small projects.<p>It was at this point I hit another road block. To expose services running on your cluster you, more or less, have to use the LoadBalance feature of Kube. It&#x27;s convenient; a single line configuration option and <i>BAM</i> your service is live on the internet with a static IP. Except for one small detail that Google barely mentions. Their load balancers cost, at minimum, $18&#x2F;mo. That&#x27;s more than my whole cluster! And my original budget was $4&#x2F;mo...<p>There are workarounds, but they are ugly. NodePort doesn&#x27;t work because you can&#x27;t expose port 80 or 443 with it. You can use host networking or a host port; something like that. Basically build your own load balancer Pod, assign it to a specific node, and manually assign a static IP to that node. (hand waving and roughly recalling the awkward solution I conjured). But it requires manual intervention every time you want to perform maintenance on your cluster. The opposite of what I was trying to achieve.<p>To sum it all up; you need to be willing to spend _at least_ $30&#x2F;mo on any given Kubernetes based project.<p>So I gave up on that idea. For now I&#x27;ve fallen on provisioning shell scripts again. Though I&#x27;ve shoved my application into containers and am using Docker-Compose to at least make it a little nicer deployment.<p>I also took a few hours to run through the Kubernetes The Hard Way tutorial; thirsty for a deeper understanding of how Kube works under the hood. It&#x27;s a fascinating system. But after working through the tutorial it became _very_ clear that Kube isn&#x27;t something you&#x27;d want to run yourself. Not unless you have a dedicate sys&#x2F;devops to manage it.<p>Also interesting is that Kube falls over when you need to run a relational database. The impedance mismatch is too great. Kube is designed for services that can be spread across many disposable nodes. Not something Postgres&#x2F;etc are designed well for. So the current recommendation, if you&#x27;re using a relational database, is to just use traditional provisional or a managed service like Cloud SQL.<p>P.S. For as long as I&#x27;ve used Google Cloud, I have and continue to be eternally frustrated by the service. It&#x27;s a complete mess. Last week while doing this exploration I ran into the problem where half my nodes were zombies; never starting and taking an hour to finally die. I had to switch regions to &quot;fix&quot; the problem. Gcloud provides _no_ support by default, even though I&#x27;m a paying customer. Rather, you have to pay _more_ for the _privilege_ of talking to someone about problems with the services you&#x27;re already paying for. Incredibly frustrating, but that&#x27;s Google&#x27;s typical M.O.<p>Not to mention 1) poor, out-dated documentation; 2) The gcloud CLI is abysmally slow to tab-complete even simple stuff; 3) The web console is made of molasses and eats an ungodly amount of CPU resources just sitting and doing nothing; 4) little to no way to restrict billing; the best you can do for most services is just set up an alert and pray that you&#x27;re awake if shit hits the fan. 5) I&#x27;m not sure I can recall a single gcloud command I&#x27;ve run lately that hasn&#x27;t spewed at least one warning or deprecation notice at me.
评论 #17349896 未加载
评论 #17349733 未加载
评论 #17349636 未加载
erikb将近 7 年前
How about making vanilla k8s usable on-premise first...
评论 #17350567 未加载