$0.9 per K80 GPU per hour, while expensive, opens up so many opportunities - especially when you can get a properly connected machine.<p>Just as an example of the change this entails for deep learning, the recent "Exploring the Limits of Language Modeling"[1] paper from Google used 32 K40 GPUs. While the K40 / K80 are not the most recent generation of GPU, they're still powerful beasts, and finding a large number of them set up well is a challenge for most.<p>In only 2 hours, their model beat previous state of the art results.
Their new best result was achieved after three weeks of compute.<p>With two assumptions, that a K80 is approximately 2 x K40 and that you could run the model with similar efficiency, that means you can beat previous state of the art for ~$28.8 and could replicate that paper's state of the art for ~$7257.6 - all using a single P2 instance.<p>While the latter number is extreme, the former isn't. It's expensive but still opens up so many opportunities. Everything from hobbyists competing on Kaggle competitions to that tiny division inside a big company that would never be able to provision GPU access otherwise - and of course the startup inbetween.<p>* I'm not even going to try to compare the old Amazon GPU instances to the new one as they're not even in the same ballpark. They have far less memory and don't support many of the features required for efficient use of modern deep learning frameworks.<p>[1]: <a href="https://arxiv.org/abs/1602.02410" rel="nofollow">https://arxiv.org/abs/1602.02410</a>
MeteoSwiss uses a 12 node cluster with 16 GPUs each to compute its weather forecasts (Piz Escha/Kesch). The machine is in operation since more than a year. We were able to disable the old machine (Piz Albis/Lema) last week.<p>The 1.1km forecast runs on 144 GPUs, the 2.2km probabilistic ensemble forecast is computed on 168 GPUs (8 GPUs or 1/2 node per ensemble member). The 7km EU forecast is run on 8 GPUs as well.
For anyone who interested in ML/DL on cloud.<p>Google Cloud Platform just released Cloud ML beta with different pricing model, see <a href="https://cloud.google.com/ml/" rel="nofollow">https://cloud.google.com/ml/</a><p>Cloud ML costing $0.49/hour to $36.75/hour, compared to AWS $0.900/hour to $14.400/hour<p>The huge different of $36.75/hour (Google) compared to $14.400/hour (AWS) make me wonder what Cloud ML are using, they mentioned GPU (TPU?) but not exact GPU model.
<p><pre><code> All of the instances are powered by an AWS-Specific version of Intel’s Broadwell processor, running at 2.7 GHz.
</code></pre>
Does anyone have any more information about this? Are the chips fabricated separately or is it microcode differences?
Keep in mind the Nvidia K80 is a 2 years old Kepler GPU. Nvidia launched 2 newer microarchitectures since then: Maxwell, Pascal. I would expect to see some P100 Pascal GPU "soon" on AWS. Maybe 6 months? (Maxwell's severely handicapped double precision performance reduces its utility for many workloads.)
One thing I discovered recently is that for GPU machines your initial limit on AWS is 0 (meaning you have to ask support before you start one for yourself)<p>(This might be an issue of my account though - having had only small bills so far)
My first thought: "I wonder what the economics are like, re: cryptocurrency mining?"<p>My second thought: "I wonder if Amazon use their 'idle' capacity to mine cryptocurrency?"<p>With respect to my second thought, at their scale, and at the cost they'd be paying for electricity, it could quite possibly be a good hedge.
This is great - we'll try to get our Tensorflow and Caffe AMI repo updated soon: <a href="https://github.com/RealScout/deep-learning-images" rel="nofollow">https://github.com/RealScout/deep-learning-images</a>
According to [1], the K80 GPUs have the following specs:<p><pre><code> Chips: 2× GK210
Thread processors: 4992 (total)
Base clock: 560 MHz
Max Boost: 875 MHz
Memory Size: 2× 12288
Clock: 5000
Bus type: GDDR5
Bus width: 2× 384
Bandwidth: 2× 240 GB/s
Single precision: 5591–8736 GFLOPS (MAD or FMA)
Double precision: 1864–2912 GFLOPS (FMA)
CUDA compute ability: 3.7
</code></pre>
Is that a good deal for $1/hour? (I'm not sure if a p2.large instance corresponds to use of one K80 or half of it)<p>How much would it cost to "train" ImageNet using such instances? Or perhaps another standard DDN task for which the data is openly available?<p>______<p>[1] <a href="https://en.wikipedia.org/wiki/Nvidia_Tesla#cite_ref-19" rel="nofollow">https://en.wikipedia.org/wiki/Nvidia_Tesla#cite_ref-19</a>
Priced this config (or close enough) on <a href="http://www.thinkmate.com/system/gpx-xt24-2460v3-8gpu" rel="nofollow">http://www.thinkmate.com/system/gpx-xt24-2460v3-8gpu</a><p>Comes to just under $50,000 for the server or roughly 4.5 months @ $14.40
Sounds like a great way to build a custom render farm. My home comouter has a dirt cheap GPU but it works well enough for basic modeling & animation. Terrible for rendering, though. I've been thinking of using ECS to build a cluster of renderers for Maya that I can spin up when needed and scale to the appropriate size. I don't know for certain if it's cheaper than going with a service, but it sounds like it is(render farm subscriptions cost hundreds), and I would get complete control over the software being used. I am glad to hear that Amazon is doing this. Granted, I'm more of a hobbyist in this arena, so maybe it wouldn't work for someone more serious about creating graphics.
It is interesting to compare this to NVidia's DGX-1 system. That server is based on the new Tesla P100 and uses NVLink rather than PCIe (about 10x faster). It boasts about 170 Tflops vs the p2.16xlarge's 64 Tflops. If you run the p2.16xlarge full time for a year it would cost about the same as buying a DGX-1. Presumably Amazon releases their GPU instances on older hardware for cost savings.<p><a href="http://www.nvidia.com/object/deep-learning-system.html" rel="nofollow">http://www.nvidia.com/object/deep-learning-system.html</a>
Stupid question: are GPU safe to be shared by two tenants in a datacenter? I read previously that there are very few security mechanisms in GPU, in particular that the memory is full of interesting garbage. So I would assume no hardware enforced separation between the VMs too.