TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

V100 Server On-Prem vs. AWS P3 Instance Cost Comparison

141 点作者 rbranson超过 6 年前

25 条评论

mbesto超过 6 年前
&gt; Our TCO includes energy, hiring a part-time system administrator, and co-location costs.<p>Which is a myopic view of TCO. This ignores so many things about purchasing on-prem hardware (for good or bad):<p>Admin:<p>- Cost associated with finding an admin who understands how this thing works<p>Speed &#x2F; convenience:<p>- Try before you buy<p>- Time it takes for the box to be built, shipped and sent to the data center<p>- Time (and cost) it takes to install software, drivers, etc<p>Maintenance &#x2F; Capitalization &#x2F; Finances:<p>- 3-year? What is the useful life of this? When does it seemingly become obsolete?<p>- AWS will continually upgrade their hardware and you keep paying the same<p>- Hardware can be capitalized, which means you can push it to the balance sheet (for tax or valuation purposes)<p>- Spending $90k instead of $184k in year 1 with the option to turn it off if you want (no longer need). This could be very valuable for a startup who wants elastic spending patterns.<p>Hidden costs:<p>- Returns, breakage, warranty in case of a hardware failure<p>I understand why there is a market for this product, but it&#x27;s not always an apples&#x2F;orange comparison. Generally speaking, if you know what your workload is going to be, (I&#x27;d be hard pressed if a lot of orgs really know the answer to this) then on-prem hardware is not a terrible choice, but it has to be analyzed appropriately.
评论 #19196604 未加载
评论 #19196969 未加载
评论 #19198400 未加载
评论 #19196831 未加载
评论 #19196598 未加载
chx超过 6 年前
This false dichotomy between colo and AWS is just making me exhausted &#x27;cos I have been repeating this for so long: just rent a dedicated server. There surely are some cases where colo is the best choice but as the years, now a decade pass since I have tried to spread this, it makes less and less sense every year -- and it never did much in the first place. Maybe if you have several racks worth of equipment? I am not familiar with that size.
评论 #19196878 未加载
评论 #19196694 未加载
评论 #19197137 未加载
sudhirj超过 6 年前
Now if they&#x27;d only throw in S3 for pseudo-infinite data storage, reliable SQS for work queue management, 10&#x2F;25&#x2F;100 Gigabit networking between the instances, redundant power supplies and cooling and racks in carefully selected stable locations for free, I&#x27;d buy a dozen!
评论 #19196867 未加载
评论 #19196548 未加载
评论 #19197043 未加载
评论 #19197935 未加载
评论 #19196611 未加载
bithavoc超过 6 年前
Buy servers if you have stable workloads, otherwide rent virtual machines in the cloud.
评论 #19196795 未加载
评论 #19196503 未加载
评论 #19196656 未加载
评论 #19196511 未加载
评论 #19196837 未加载
Scaevolus超过 6 年前
If you&#x27;re not deploying in a datacenter, you can save even more money by building a workstation with a few 2080 Ti cards, which cost $1200 and give 90% of the speed of the $3000 Titan V: <a href="https:&#x2F;&#x2F;lambdalabs.com&#x2F;blog&#x2F;best-gpu-tensorflow-2080-ti-vs-v100-vs-titan-v-vs-1080-ti-benchmark&#x2F;" rel="nofollow">https:&#x2F;&#x2F;lambdalabs.com&#x2F;blog&#x2F;best-gpu-tensorflow-2080-ti-vs-v...</a>
评论 #19197633 未加载
elchief超过 6 年前
In my mind, the reason to use something like AWS is to a) get your servers in minutes instead of weeks and b) easily right-size your service<p>Once your service is somewhat stable in terms of size and you can afford longer lead times, then you should return to on-prem to save money
评论 #19201557 未加载
ti_ranger超过 6 年前
1)How long does it take to get a Lambda Hyperplane operational, from the point I place an order?<p>A p3dn.24xlarge is a few minutes.<p>My experience in deploying new hardware-based solutions is that it typically takes between 6 and 12 weeks.<p>2)How does the TCO compare when I only need to train for 2 hours a day?<p>3)If I was previously training on AWS p2.16xl, I could upgrade to p3.16xl with basically 0 incremental cost (for the same workload). Does LambdaLabs offer free (0 capex) upgrades? If so, how long would it take to upgrade?
评论 #19196983 未加载
评论 #19198202 未加载
Thorrez超过 6 年前
Save $69k sounds a lot more significant than save 38%.<p>If you&#x27;re going to be using the server less than 73% of the time, AWS sounds better.
评论 #19197309 未加载
评论 #19196560 未加载
manigandham超过 6 年前
Cloud computing was never about price, it was about the ability to provision and operate infrastructure instantly through an API.<p>If you can take advantage of that flexibility to build reactive capacity then you can save money, but that wasn&#x27;t the initial driving point.
评论 #19197115 未加载
评论 #19197142 未加载
dc_gregory超过 6 年前
I&#x27;m inexperienced in the hardware front, would that machine likely not break down under a large load for 3 years straight? Nothing is set aside for hardware failure etc.
评论 #19196705 未加载
评论 #19196495 未加载
评论 #19197946 未加载
评论 #19196493 未加载
评论 #19196589 未加载
评论 #19196506 未加载
评论 #19196502 未加载
zten超过 6 年前
Who&#x27;s running model training 24&#x2F;7 to justify reserving this instance or co-locating your own hardware? (Apologies in advance for not being very imaginative)<p>Their ImageNet timing fits within the bounds of a Spot Duration workload, so in the most optimistic scenario, you can subtract 70% from the price - assuming spot availability for this instance type. (Of course, there are many more model training exercises that don&#x27;t even remotely fit inside 6 hours.)
boulos超过 6 年前
Disclosure: I work on Google Cloud.<p>First, thanks for writing this up. Too many people just take a “buy the box, divide by number of hours in 3 years approach”. Your comparison to a 3-year RI at AWS versus the hardware is thus more fair than most. You’re still missing a lot of the opportunity cost (both capital and human), scaling (each of these is probably 3 kW, and most electrical systems couldn’t handle say 20 of those), and so on.<p>That said, I don’t agree that 3 years is a reasonable depreciation period for GPUs for deep learning (the focus of this analysis). If you had purchased a box full of P100s before the V100 came out, you’d have regretted it. Not just in straight price&#x2F;performance, but also operator time: a 2x speedup on training also yields faster time-to-market and&#x2F;or more productive deep learning engineers (expensive!).<p>People still use K80s and P100s for their relative price&#x2F;performance on FP64 and FP32 generic math (V100s come at a high premium for ML and NVIDIA knows it), but for most deep learning you’d be making a big mistake. Even FP32 things with either more memory per part or higher memory bandwidth mean that you’d rather not have a 36-month replacement plan.<p>If you really do want to do that, I’d recommend you buy them the day they come out (AWS launched V100s in October 2017, so we’re already 16 months in) to minimize the refresh regret.<p>tl;dr: unless the V100 is the perfect sweet spot in ML land for the next three years or so, a 3-year RI or a physical box will decline in utility.
freediver超过 6 年前
The actual cost of running a cloud instance is inflated. The cheapest way to run them is using spot&#x2F;interruptible instances which for most deep learning jobs will suffice. If anything there will be some upfront cost to set it up in a way that it automatically manages interruptions, storage etc. Also by not limiting yourself to AWS you can have many other options.<p>With this setup you can get 2x4x V100 on Azure for a total of $42k&#x2F;year (assuming running 24&#x2F;7).<p>Even if one spent $40k to write code for spot instance management this is by far the cheapest solution for GPU compute both short term and long term.<p>source for calculation: <a href="https:&#x2F;&#x2F;cloudoptimizer.io" rel="nofollow">https:&#x2F;&#x2F;cloudoptimizer.io</a>
bubblethink超过 6 年前
p3dn.24xlarge&#x27;s pricing makes no sense at all. It feels like aws did it to pull off some PR&#x2F;marketing stunt without any real users in mind. I&#x27;ve tried getting spot instances for it, but aws just errors out. So they don&#x27;t even have enough of them to allow spot instances. And it&#x27;s a gpu machine. So the usual arguments of scaling up on demand or adapting to load don&#x27;t really apply. You either have this usecase or you don&#x27;t. And if you do, just buy the hardware.
评论 #19196554 未加载
rb808超过 6 年前
I bought a second hand Xeon E3-1246 v3 (8 VCPU), 16GB memory for $250 on ebay. That&#x27;s less than it costs to rent an a1.xlarge for 6 months. Hardware is so cheap now, esp with SSDs and memory getting cheaper. Don&#x27;t automatically rent!
评论 #19196622 未加载
raincom超过 6 年前
This makes sense only if your prospective clients want &quot;lift and shift&quot; into the cloud. But lots of people are using AWS for their services like S3, RDS, Cloud Front, Route 53, etc.
评论 #19196996 未加载
purplezooey超过 6 年前
damn, &quot;includes hiring a part time system administrator&quot;.
评论 #19196512 未加载
deepnotderp超过 6 年前
It may be useful to note that most deep learning workloads for training are pretty latency insensitive and are pretty flat throughout the day.
mbell超过 6 年前
I&#x27;d be curious what the TCO is when factoring in storage. i.e. What is replacing S3 for data storage in the colo setup?
评论 #19196574 未加载
评论 #19197003 未加载
评论 #19196550 未加载
canadev超过 6 年前
Interesting note about Lambda Labs, all of the press links on <a href="https:&#x2F;&#x2F;lambdalabs.com&#x2F;?ref=blog" rel="nofollow">https:&#x2F;&#x2F;lambdalabs.com&#x2F;?ref=blog</a> are about a ~&quot;privacy violating Google Glass app&quot; that recognizes faces and geotags photos of them.<p>I don&#x27;t see why they choose to promote that link now.
iheartpotatoes超过 6 年前
I thought ads on HN were discouraged?
评论 #19196520 未加载
m0zg超过 6 年前
If it&#x27;s really on-prem (i.e. not in the &quot;datacenter&quot; as per NVIDIA EULA), you could spend a lot less than $100K+ for a lot more throughput by purchasing consumer-grade cards and HEDT gaming hardware.<p>Sure you&#x27;ll have 4 GPUs per box and not 8, and sure, each GPU will have 11GB and not 32, but the whole machine (_with_ the GPUs) will cost just a tad more than a single V100. So if you don&#x27;t really need 32GB of VRAM per GPU (and you most likely don&#x27;t), it&#x27;d be insane to pay literally 5x as much as you have to.
ringaroll超过 6 年前
Nice.
ilaksh超过 6 年前
This is just the most extreme example. AWS is just really expensive.<p>If you want a VPS take a look at Digital Ocean or Linode.
评论 #19196485 未加载
hughesjo超过 6 年前
It&#x27;s not fair comparison unless we are comparing all in costs that include ops.
评论 #19196481 未加载
评论 #19196544 未加载