New P2 Instance Type for Amazon EC2 – Up to 16 GPUs

223 pointsby jeffbarrover 8 years ago

19 comments

Smerityover 8 years ago

$0.9 per K80 GPU per hour, while expensive, opens up so many opportunities - especially when you can get a properly connected machine.Just as an example of the change this entails for deep learning, the recent "Exploring the Limits of Language Modeling"[1] paper from Google used 32 K40 GPUs. While the K40 / K80 are not the most recent generation of GPU, they're still powerful beasts, and finding a large number of them set up well is a challenge for most.In only 2 hours, their model beat previous state of the art results. Their new best result was achieved after three weeks of compute.With two assumptions, that a K80 is approximately 2 x K40 and that you could run the model with similar efficiency, that means you can beat previous state of the art for ~$28.8 and could replicate that paper's state of the art for ~$7257.6 - all using a single P2 instance.While the latter number is extreme, the former isn't. It's expensive but still opens up so many opportunities. Everything from hobbyists competing on Kaggle competitions to that tiny division inside a big company that would never be able to provision GPU access otherwise - and of course the startup inbetween.* I'm not even going to try to compare the old Amazon GPU instances to the new one as they're not even in the same ballpark. They have far less memory and don't support many of the features required for efficient use of modern deep learning frameworks.[1]: <a href="https://arxiv.org/abs/1602.02410" rel="nofollow">https://arxiv.org/abs/1602.02410</a>

评论 #12611643 未加载

评论 #12611311 未加载

评论 #12611349 未加载

评论 #12611123 未加载

__mpover 8 years ago

MeteoSwiss uses a 12 node cluster with 16 GPUs each to compute its weather forecasts (Piz Escha/Kesch). The machine is in operation since more than a year. We were able to disable the old machine (Piz Albis/Lema) last week.The 1.1km forecast runs on 144 GPUs, the 2.2km probabilistic ensemble forecast is computed on 168 GPUs (8 GPUs or 1/2 node per ensemble member). The 7km EU forecast is run on 8 GPUs as well.

评论 #12612312 未加载

评论 #12613121 未加载

paukiatweeover 8 years ago

For anyone who interested in ML/DL on cloud.Google Cloud Platform just released Cloud ML beta with different pricing model, see <a href="https://cloud.google.com/ml/" rel="nofollow">https://cloud.google.com/ml/</a>Cloud ML costing $0.49/hour to $36.75/hour, compared to AWS $0.900/hour to $14.400/hourThe huge different of $36.75/hour (Google) compared to $14.400/hour (AWS) make me wonder what Cloud ML are using, they mentioned GPU (TPU?) but not exact GPU model.

评论 #12611235 未加载

评论 #12611484 未加载

评论 #12611182 未加载

topbananaover 8 years ago

<pre><code> All of the instances are powered by an AWS-Specific version of Intel’s Broadwell processor, running at 2.7 GHz. </code></pre> Does anyone have any more information about this? Are the chips fabricated separately or is it microcode differences?

评论 #12611486 未加载

mrbover 8 years ago

Keep in mind the Nvidia K80 is a 2 years old Kepler GPU. Nvidia launched 2 newer microarchitectures since then: Maxwell, Pascal. I would expect to see some P100 Pascal GPU "soon" on AWS. Maybe 6 months? (Maxwell's severely handicapped double precision performance reduces its utility for many workloads.)

评论 #12611455 未加载

评论 #12612332 未加载

sysexitover 8 years ago

Pretty classless how Jeff describes TensorFlow as an "Open Source library," without atributing it to Google.

评论 #12612124 未加载

raverbashingover 8 years ago

One thing I discovered recently is that for GPU machines your initial limit on AWS is 0 (meaning you have to ask support before you start one for yourself)(This might be an issue of my account though - having had only small bills so far)

评论 #12611173 未加载

评论 #12611372 未加载

评论 #12611696 未加载

phs318uover 8 years ago

My first thought: "I wonder what the economics are like, re: cryptocurrency mining?"My second thought: "I wonder if Amazon use their 'idle' capacity to mine cryptocurrency?"With respect to my second thought, at their scale, and at the cost they'd be paying for electricity, it could quite possibly be a good hedge.

评论 #12611477 未加载

评论 #12612374 未加载

评论 #12611387 未加载

评论 #12611677 未加载

评论 #12611836 未加载

chrisconleyover 8 years ago

This is great - we'll try to get our Tensorflow and Caffe AMI repo updated soon: <a href="https://github.com/RealScout/deep-learning-images" rel="nofollow">https://github.com/RealScout/deep-learning-images</a>

seanwilsonover 8 years ago

How do Amazon (or any other cloud provider) make sure they have enough of these machines to cope with the demand for them without getting too many?

评论 #12612002 未加载

评论 #12612936 未加载

评论 #12611223 未加载

评论 #12612050 未加载

评论 #12611878 未加载

ivan_ahover 8 years ago

According to [1], the K80 GPUs have the following specs:<pre><code> Chips: 2× GK210 Thread processors: 4992 (total) Base clock: 560 MHz Max Boost: 875 MHz Memory Size: 2× 12288 Clock: 5000 Bus type: GDDR5 Bus width: 2× 384 Bandwidth: 2× 240 GB/s Single precision: 5591–8736 GFLOPS (MAD or FMA) Double precision: 1864–2912 GFLOPS (FMA) CUDA compute ability: 3.7 </code></pre> Is that a good deal for $1/hour? (I'm not sure if a p2.large instance corresponds to use of one K80 or half of it)How much would it cost to "train" ImageNet using such instances? Or perhaps another standard DDN task for which the data is openly available?______[1] <a href="https://en.wikipedia.org/wiki/Nvidia_Tesla#cite_ref-19" rel="nofollow">https://en.wikipedia.org/wiki/Nvidia_Tesla#cite_ref-19</a>

ajaimkover 8 years ago

Priced this config (or close enough) on <a href="http://www.thinkmate.com/system/gpx-xt24-2460v3-8gpu" rel="nofollow">http://www.thinkmate.com/system/gpx-xt24-2460v3-8gpu</a>Comes to just under $50,000 for the server or roughly 4.5 months @ $14.40

评论 #12613546 未加载

ravenstineover 8 years ago

Sounds like a great way to build a custom render farm. My home comouter has a dirt cheap GPU but it works well enough for basic modeling & animation. Terrible for rendering, though. I've been thinking of using ECS to build a cluster of renderers for Maya that I can spin up when needed and scale to the appropriate size. I don't know for certain if it's cheaper than going with a service, but it sounds like it is(render farm subscriptions cost hundreds), and I would get complete control over the software being used. I am glad to hear that Amazon is doing this. Granted, I'm more of a hobbyist in this arena, so maybe it wouldn't work for someone more serious about creating graphics.

spullaraover 8 years ago

It is interesting to compare this to NVidia's DGX-1 system. That server is based on the new Tesla P100 and uses NVLink rather than PCIe (about 10x faster). It boasts about 170 Tflops vs the p2.16xlarge's 64 Tflops. If you run the p2.16xlarge full time for a year it would cost about the same as buying a DGX-1. Presumably Amazon releases their GPU instances on older hardware for cost savings.<a href="http://www.nvidia.com/object/deep-learning-system.html" rel="nofollow">http://www.nvidia.com/object/deep-learning-system.html</a>

cm2187over 8 years ago

Stupid question: are GPU safe to be shared by two tenants in a datacenter? I read previously that there are very few security mechanisms in GPU, in particular that the memory is full of interesting garbage. So I would assume no hardware enforced separation between the VMs too.

评论 #12611099 未加载

评论 #12614117 未加载

clishemover 8 years ago

Pricing?

评论 #12611087 未加载

epberryover 8 years ago

Yess thank you, thank you, thank you. I was just signing up for the Azure N Series preview but we're good to go now :).

nik736over 8 years ago

Anyone knows if video transcoding on GPUs (with FFMPEG) is viable nowadays? If yes, what are the gains?

评论 #12611264 未加载

评论 #12612251 未加载

评论 #12611412 未加载

asendraover 8 years ago

Damn, I would love to have an excuse to play around with this.