科技回声

5 条评论

minimaxir将近 7 年前

With the new discounts on preemptible GPUs (<a href="https://cloudplatform.googleblog.com/2018/06/Introducing-improved-pricing-for-Preemptible-GPUs.html" rel="nofollow">https://cloudplatform.googleblog.com/2018/06/Introducing-imp...</a>), the economics of quickly spinning up a fleet of GPUs with Kubernetes for a quick parallelizable ML task become very interesting. (assuming that Google allows enough GPU quota for a fleet of GPUs for nonenterprise users anyways)What I want to use Kubernetes + instant-GPU-fleet for deep learning hyperparameter grid searching. (i.e. spin up a lot of preemptible GPUs; for each parameter config, train the model on a single GPU in parallel for linear scanning speed scaling).Kubeflow (<a href="https://github.com/kubeflow/kubeflow" rel="nofollow">https://github.com/kubeflow/kubeflow</a>) is close to this functionality, but not quite there yet in user-friendlyness. (you have to package everything in a huge Docker container and launch jobs from the CLI; ideally what I want to do is to spawn containers and start training directly from the JupyterHub notebook on the master node)

评论 #17349409 未加载

评论 #17350364 未加载

评论 #17349956 未加载

kozikow将近 7 年前

We have been using GPUs with GKE for a while. At some point, we used 20+ GPUs in an production workflow without any problems.Everything generally works well, maybe except the initial phase when some containers won't port well from nvidia-docker-compose due to problems with Cuda libraries. Ideally, you need to match the version of Cuda everywhere.My dev setup for quick experimentation with GPU docker container on GKE: <a href="https://tensorflight.blog/2018/02/23/dev-environment-for-gke/" rel="nofollow">https://tensorflight.blog/2018/02/23/dev-environment-for-gke...</a> .

rjain15将近 7 年前

Are these GPUs on bare metal or virtualized GPUs on VMs

评论 #17349708 未加载

fpgaminer将近 7 年前

On a related note, last week I took a dive into Kubernetes on gcloud for a personal project and came out with some interesting knowledge.First off, this was for a _small_ personal project. Something that I originally intended to run on an f1-micro. I decided to check out Kubernetes mostly to learn, but also to see if it could offer a more maintainable setup (typically I just write a mess of provisioning shell scripts and cloud-init scripts to bring up servers. A bit of a mess to maintain long-term). So basically, I was using Kubernetes "wrong"; its target audience is workloads that intend to use a fleet of machines. But I trudged forward anyway.This resulted in the first problem. You can't spin up a Kubernetes cluster with just one f1-micro. Google won't let you. I could either do a 3x f1-micro cluster, which would be ~$12/month, or 1x f1-small, should would be about the same price. Contrast with my original plan of a single f1-micro, which is ~$4/mo. Hmm...Well after playing around I discovered a "bug" in gcloud's tooling. You can spin up a 3x f1-micro cluster. Then add a node pool with just one or two f1-micros in it. Then kill the original node pool. This is all allowed, and results in a final cluster with only one or two nodes in it. Nice. "I know what I'm doing, Google is just being a dick!" I thought. I could still spin up Pods on the cluster, no problem.Then the second discovery. The Kubernetes console was reporting unschedulable system pods. Turns out, Google has a reason for these minimums.All the system pods, the pods that help orchestrate the show, provide logging, metrics, etc; they take up a whopping 700 MB of RAM and a good chunk of CPU as well. I was a bit shocked.I'm sure most developers are just shrugging right now. 700 MB is nothing these days. But remember, my original plan was a single f1-micro which only has 700 MB. This is a personal project, so every bit counts to keep long-term costs down. And, in deep contrast to Kubernetes' gluttony, the app I intend to run on this system only uses ~32 MB under usual loads. That's right; 32 MB. It's a webapp running on a Rust web server.So hopefully you can imagine my shock at Kube's RAM usage. As I dug in I discovered most all of the services are built using Go. No wonder. I love Go, but it's a memory hog. My mind started imagining what the RAM usage would be like if all these services had been written in Rust...Point being, 700MB exceeds what one f1-micro can handle. And it exceeds what two f1-micros can handle, because a lot of those services are per-node services. Combined with the base RAM usage of the (surprisingly) bloated Container OS that Google runs on the nodes in the cluster. (Spinning up a Container Optimized image on a GCE instance I measured something like 500 or more RAM usage on a bare install...). And hence why Google won't let you spin up a cluster of less than three f1-micros. You can, however, use a single f1-small since it has 1.7MB of RAM in a single instance.At this point I resigned myself to just having a cluster of three nodes. shrug the expense of learning, I suppose. And perhaps I could re-use the cluster to host other small projects.It was at this point I hit another road block. To expose services running on your cluster you, more or less, have to use the LoadBalance feature of Kube. It's convenient; a single line configuration option and BAM your service is live on the internet with a static IP. Except for one small detail that Google barely mentions. Their load balancers cost, at minimum, $18/mo. That's more than my whole cluster! And my original budget was $4/mo...There are workarounds, but they are ugly. NodePort doesn't work because you can't expose port 80 or 443 with it. You can use host networking or a host port; something like that. Basically build your own load balancer Pod, assign it to a specific node, and manually assign a static IP to that node. (hand waving and roughly recalling the awkward solution I conjured). But it requires manual intervention every time you want to perform maintenance on your cluster. The opposite of what I was trying to achieve.To sum it all up; you need to be willing to spend _at least_ $30/mo on any given Kubernetes based project.So I gave up on that idea. For now I've fallen on provisioning shell scripts again. Though I've shoved my application into containers and am using Docker-Compose to at least make it a little nicer deployment.I also took a few hours to run through the Kubernetes The Hard Way tutorial; thirsty for a deeper understanding of how Kube works under the hood. It's a fascinating system. But after working through the tutorial it became _very_ clear that Kube isn't something you'd want to run yourself. Not unless you have a dedicate sys/devops to manage it.Also interesting is that Kube falls over when you need to run a relational database. The impedance mismatch is too great. Kube is designed for services that can be spread across many disposable nodes. Not something Postgres/etc are designed well for. So the current recommendation, if you're using a relational database, is to just use traditional provisional or a managed service like Cloud SQL.P.S. For as long as I've used Google Cloud, I have and continue to be eternally frustrated by the service. It's a complete mess. Last week while doing this exploration I ran into the problem where half my nodes were zombies; never starting and taking an hour to finally die. I had to switch regions to "fix" the problem. Gcloud provides _no_ support by default, even though I'm a paying customer. Rather, you have to pay _more_ for the _privilege_ of talking to someone about problems with the services you're already paying for. Incredibly frustrating, but that's Google's typical M.O.Not to mention 1) poor, out-dated documentation; 2) The gcloud CLI is abysmally slow to tab-complete even simple stuff; 3) The web console is made of molasses and eats an ungodly amount of CPU resources just sitting and doing nothing; 4) little to no way to restrict billing; the best you can do for most services is just set up an alert and pray that you're awake if shit hits the fan. 5) I'm not sure I can recall a single gcloud command I've run lately that hasn't spewed at least one warning or deprecation notice at me.

GPUs as a service with Kubernetes Engine are now generally available

5 条评论

GPUs as a service with Kubernetes Engine are now generally available

5 条评论