AWS doesn't make sense for scientific computing

295 pointsby lebovicover 2 years ago

66 comments

Having had the responsibility of providing HPC for a literal buildings full of scientists, I can say that it may be true that you can get computation cheaper with owned hardware, than in a cloud. Certainly pay as you go, individual project at a time processing will look that way to the scientist. But I can also say with confidence that the contest is far closer than they think. Scientists who make this argument almost invariably leave major costs out of their calculation - assuming they can put their servers in a closet,maintain them themselves, do all the security infrastructure, provide redundancy and still get to shared compute when they have an overflow need. When the closet starts to smoke because they stuffed it with too many cheaply sourced, hot-running cores and GPUs, or gets hacked by one of their postdocs resulting in an institutional HIPAA violation, well, that's not their fault.Put like for like in a well managed data center against negotiated and planned cloud services, and the former may still win, but it won't be dramatically cheaper, and figured over depreciable lifetime and including opportunity cost, may cost more. It takes work to figure out which is true.

评论 #33123744 未加载

评论 #33123763 未加载

评论 #33124956 未加载

评论 #33128778 未加载

评论 #33125360 未加载

评论 #33126931 未加载

评论 #33124418 未加载

评论 #33127833 未加载

评论 #33124775 未加载

评论 #33127361 未加载

评论 #33124937 未加载

评论 #33125976 未加载

评论 #33127849 未加载

评论 #33152935 未加载

评论 #33128209 未加载

评论 #33126194 未加载

评论 #33124704 未加载

pclmulqdqover 2 years ago

Even as a big cloud detractor, I have to disagree with this.A lot of scientific computing doesn't need a persistent data center, since you are running a ton of simulations that only take a week or so, and scientific computing centers at big universities are a big expense that isn't always well-utilized. Also, when they are full, jobs can wait weeks to run.These computing centers have fairly high overhead, too, although some of that is absorbed by the university/nonprofit who runs them. It is entirely possible that this dynamic, where universities pay some of the cost out of your grant overhead, makes these computing centers synthetically cheaper for researchers when they are actually more expensive.One other issue here is that scientific computing really benefits from ultra-low-latency infiniband networks, and the cloud providers offer something more similar to a virtualized RoCE system, which is a lot slower. That means accounting for cloud servers potentially being slower core-for-core.

评论 #33124175 未加载

评论 #33127175 未加载

评论 #33133880 未加载

KaiserProover 2 years ago

Its much more complex than described.The author is making a brilliant argument for getting a secondhand workstation and shoving under their desk.If you are doing multi machine batch style processing, then you won't be using ondemand, you'd use the spot pricing. The missing argument in that part is storage costs. Managing a high speed, highly available synchronous file system that can do a sustained 50gb/sec is hard bloody work (no S3 isnt a good fit, too much management overhead)Don't get me wrong AWS _is_ expensive if you are using a machine for more than a month or two.however if you are doing highly parallel stuff, Batch and lustre on demand is pretty ace.If you are doing a multi-year project, then real steel is where its at. Assuming you have factored in hosting, storage and admin costs.

评论 #33124443 未加载

评论 #33123450 未加载

评论 #33123669 未加载

danking00over 2 years ago

I think this post is identifying scientific computing with simulation studies and legacy workflows, to a fault. Scientific computing includes those things, but it also includes interactive analysis of very large datasets as well as workflows designed around cloud computing.Interactive analysis of large datasets (e.g. genome & exome sequencing studies with 100s of 1000s of samples) is well suited to low-latency, server-less, & horizontally scalable systems (like Dremel/BigQuery, or Hail [1], which we build and is inspired by Dremel, among other systems). The load profile is unpredictable because after a scientist runs an analysis they need an unpredictable amount of time to think about their next step.As for productionized workflows, if we redesign the tools used within these workflows to directly read and write data to cloud storage as well as to tolerate VM-preemption, then we can exploit the ~1/5 cost of preemptible/spot instances.One last point: for the subset of scientific computing I highlighted above, speed is key. I want the scientist to stay in a flow state, receiving feedback from their experiments as fast as possible, ideally within 300 ms. The only way to achieve that on huge datasets is through rapid and substantial scale-out followed by equally rapid and substantial scale-in (to control cost).[1] <a href="https://hail.is" rel="nofollow">https://hail.is</a>

评论 #33125473 未加载

zmmmmmover 2 years ago

A missing elements for me is that with a lot of exploratory scientific work we (often half intentionally) have no idea what we are doing. We can easily run a giant job that uses 100x more compute than expected. Yes you can limit cloud compute resources if you are smart but its much better is the default is that you run out of compute and your job takes longer than you get a 100x bill back from our cloud provider. And if you limited your cloud resources to a fixed amount, didn't you just eliminate half the benefit of using cloud in the first place?Then the problems of data management, transfer and egress are huge. Again the "no idea what we are doing" factor comes into play. If you have a really good idea up front what is going to happen you can plan out a strategy that minimises costs. But if you have no idea at all - genuinely, because this is science and we are doing new things - then you could end up blowing huge amounts of money on unnecessary egress and storage costs. And at the small end we can be talking about experiments run on a shoestring where a few thousand dollars is a big deal.The way I see it, we need everything - powerful individual workstations / laptops for direct analysis, then a tier of fixed HPC style compute for this kind of work that is poorly matched to cloud, and then for specific projects where it makes sense (massive scaleout, exotic hardware needs - GPUs, FPGAs etc) you embrace cloud resources for that.

ordielover 2 years ago

Having worked for 2 of the largest cloud providers (1 of them beimg the largest) i have to say "The Cloud" just doesnt makes sense (maybe with the exception of cloud storage) yet for most use cases, this including start ups, small and, mid size companies its just way to expensive for the benefits it provides, it moves your hardware acquisitions /maintainance cost to development costs, you just think better/cheaper because that cost comes in small monthly chunks rather than as a single bill, plus you add all security risks either those introduced by the vendor or those introduced by the masive complexity and poor training of the developers which if you want to avoid will have to pay by hiring a developer competent in security for that particular cloud provider

评论 #33125883 未加载

aschleckover 2 years ago

This is sort of a confusing article because it assumes the premise of "you have a fixed hardware profile" and then argues within that context ("Most scientific computing runs on queues. These queues can be months long for the biggest supercomputers".) Of course if you're getting 100% utilization then you'll find better raw pricing (and this article conveniently leaves out staffing costs), but this model misses one of the most powerful parts of cloud providers: autoscaling. Why would you want to waste scientist time by making them wait in a queue when you can just instead autoscale as high as needed? Giving scientists a tight iteration loop will likely be the biggest cost reduction and also the biggest benefit. And if you're doing that on prem then you need to provision for the peak load, which drives your utilization down and makes on prem far less cost effective.

评论 #33123376 未加载

评论 #33123294 未加载

jupp0rover 2 years ago

Having worked in the high performance computing field and in cloud hosted commercial applications, I can agree with the article but for entirely different reasons. The reason why some scientific computing shouldn't be done on AWS has to do with networking and latency between compute nodes. Supercomputers often use specialized networking hardware to get single digit microsecond latencies for data transfer between compute nodes and much higher network bandwidth than what you would normally find between EC2 nodes. This allows simulations to efficiently operate on really large data sets that span hundreds or thousands of nodes. The network topology between these nodes is often denser than a tree (think a 2D or 3D grid topology) and offers shorter paths between nodes.All of this allows you to run code that you can't run in AWS unless it fits on one computer only. It's also way more expensive than clusters of commodity hardware.For problems that are trivially parallelizable without much communication between nodes - I don't think that most universities can actually operate those cheaper than renting them from cloud computing services. A lot of these calculations don't take the staff to operate data centers, the cost of the building itself or the opportunity cost of using lots of space for this purpose vs something else into account. Economics of scale also kick in here. It's way cheaper per computer for AWS to admin a data center because they do this for orders of magnitude bigger data centers than your typical university.

评论 #33129907 未加载

aBioGuyover 2 years ago

Furthermore, scientific computing often (usually?) involves trainees. It can difficult to train people when small mistakes can have five figure bills.

评论 #33126067 未加载

评论 #33128355 未加载

idiot900over 2 years ago

This rings true for me. I have a federal grant that prohibits me from using its funds for capital acquisitions: i.e. servers. But I can spend it on AWS at massive cost for minimal added utility for my use case. Even though it would be a far better use of taxpayer funds to buy the servers, I have to rent them instead.

评论 #33123775 未加载

评论 #33123460 未加载

评论 #33123590 未加载

评论 #33125071 未加载

tejtmover 2 years ago

Cloud never has made sense for scientific computing. Renting someone else's big computer makes good sense in a business setting where you are not paying for your peak capacity when you are not using it, and you are not losing revenue by underestimating whatever the peak capacity the market happens to dictate.For business, outsourcing compute cost center eliminates both cost and risk for a big win each quarter.Scientists never say, Gee it isn't the holiday season, guess we better scale things back.Instead they will always tend to push whatever compute limit there is, it is kinda in the job description.As for the grant argument, that is letting the tool shape the hand.business-science is not science, we will pay now or pay later.

bluedinoover 2 years ago

We have 500-node cluster at a chemical company, and we've been experimenting with "hybrid-cloud". This allows jobs to use servers with resources we just don't have, or couldn't add fast enough.Storage is a huge issue for us. We have a petabyte of local storage from big name vendor that's bursting at the seams, and expensive to upgrade. A lot of our users leave big files laying around for a large time. Every few months we have to hound everyone to delete old stuff.The other thing that you get with the cloud is there's way more accountability for who's using how much resources. Right now we just let people have access and roam free. Cloud HPC is 5-10x more in cost and the beancounters would shut shit down real quick if the actual costs were divvied up.We also still have a legacy datacenter so in a similar vein, it's hard to say how much not having to deal with physical hardware/networking/power/bandwidth would be worth. Our work is maybe 1% of what that team does.

评论 #33125741 未加载

rpepover 2 years ago

I think there are some things this misses about the scientific ecosystem in Universities/etc. that can make the cloud more attractive than it first appears:* If you want to run really big jobs e.g. with multiple multi-GPU nodes, this might not even be possible depending on your institution or your access. Most research-intensive Universities have a cluster but they’re not normally big machines. For regional and national machines, you usually have to bid for access for specific projects, and you might not be successful.* You have control of exactly what hardware and OS you want on your nodes. Often you’re using an out of date RHEL version and despite spack and easybuild gaining ground, all too often you’re given a compiler and some old versions of libraries and that’s it.* For many computationally intensive studies, your data transfer actually isn’t that large. For e.g. you can often do the post-processing on-node and then only get aggregate statistics about simulation runs out.

captainmuonover 2 years ago

A former colleague did his PHD in particle physics with a novel technique (matrix element method). I can't really explain it, but it is extremely CPU intensive. That working group did it on CERN's resources, and they had to borrow quotas from a bunch of other people. For fun they calculated how much it would have cost on AWS and came up with something ridiculous like 3 million euros.

评论 #33126473 未加载

评论 #33124397 未加载

评论 #33124223 未加载

Helmut10001over 2 years ago

I almost got a Tenure Track position at a Data Science Faculty in Virginia and I think them not having a HPC was the single issue that blew this move (from both sides). During interviews, I was asking the dean how they set up their HPC - turned out, they hadn't. I then asked a Professor in the next review round how they teach their students without a HPC:> "I buy all resources on AWS - it's painfull because I have to contact AWS almost monthly for accidental over-billing, but we don't have a solution".All of this made me really sceptical, since coming from a big University in Germany, we have unlimited HPC resources for free. I have 16 VMs, the biggest one 125 GB Memory, I can set those up or move around how I want. No space limitations - in need 10 TB of space for 3 month? Open a service ticket, 3 hours later it's available. Ports need to be opened worldwide to the web? No problem. Need a Jupyter Hub Cluster on Kubernetes? Here you go. This has really improved my work (quality, performance, and convenience) so much.I was once coordinator of a research project where we had 30k EUR left and didn't know what to do with it. I contacted our HPC and asked if they want the money - answer: "30k really isn't worth the effort, we don't know what to do with it atm."

评论 #33130187 未加载

julienchastangover 2 years ago

I’ve also been skeptical of the commercial cloud for scientific computing workflows. I don’t think this cost benefit analysis mentions it, but the commercial cloud makes even less sense when you take into account brick and mortar considerations. In other words, if your company/institution has already paid for the machine rooms, sys admins, networks, the physical buildings, the commercial cloud is even less appealing. This is especially true with “persistent services” for example data servers that are always on because they handle real-time data, for example.Another aspect of scientific computing on the commercial cloud that’s a pain if you work in academia is procurement or paying for the cloud. Academic groups are much more comfortable with the grant model. They often operate on shoe-string budgets and are simply not comfortable entering a credit card number. You can also get commercial cloud grants, but they often lack long-term, multiyear continuity.

评论 #33125273 未加载

mbreeseover 2 years ago

I completely agree for most cases. In many scientific computing applications, compute time isn’t the factor you prioritize in the good/fast/cheap triad. Instead, you often need to do things as cheaply as possible. And your data access isn’t always predictable, so you need to keep results around for an extended period of time. This makes storage costs a major factor. For us, this alone was enough to move workloads away from cloud and onto local resources.

prplover 2 years ago

Actually computing is fine for most use cases (spot instances, preemptible VMs on GCP) and have been used in lots of situations, even at CERN. Where it also excels is if you need any kind of infrastructure, because no HPC center has figured a reasonable approach to that (some are trying with k8s). Also, obviously, you get a huge selection of hardware.Where cloud/aws doesn’t make sense is storage, especially if you need egress, and if you actually need IB

fwipover 2 years ago

The killer we've seen is data egress costs. Crunching the numbers for some of our pipelines, we'd actually be paying more to get the data out of AWS than to compute it.

评论 #33124809 未加载

bsenftnerover 2 years ago

This is the case for a large class of big data + high compute applications. Animation / simulation in engineering, planning, forecasting, not to mention entertainment require pipelines the typical cloud is simply too expensive to use.

zatarcover 2 years ago

Why does no one consider colocation services anymore?And why do people only know Hetzner, OVH and Linode as alternatives to the big cloud providers?There are so many good and inexpensive server hosting providers, some with decades of experience.

评论 #33123540 未加载

slaymaker1907over 2 years ago

Cloud worked really well for me when I was in school. A lot of the time, I would only need a beefy computer for a few hours at a time (often due to high memory usage) and you can/could rent out spot instances for very cheap. There are about 730 hours per month so the cost calculus is very different for a student/researcher who needs fast turnaround times (high performance), but only for a short period of time.However, I know not all HPC/scientific computing works that way and some workloads are much more continuous.

评论 #33124095 未加载

secabeenover 2 years ago

The general rule of thumb in the HPC world is if you can keep a system computing for more than 40% of the time, it will be cheaper to buy.

dastbeover 2 years ago

using on-demand for latency insensitive work, especially when you’re also very cost sensitive, isn’t the right choice. spot instances will get you somewhere in the realm of the hetzner/on-prem numbers.

评论 #33123307 未加载

评论 #33123217 未加载

评论 #33123286 未加载

Moissaniteover 2 years ago

This has been my exact field of work for a few years now; in general I have found that:When people claim it is 10x more expensive to use public cloud, they have no earthly idea what it actually costs to run a HPC service, a data centre, or do any of the associated maintenance.When the claim is 3x more expensive in the cloud, they do know those things but are making a bad faith comparison because their job involves running an on-premises cluster and they are scared of losing their toys.When the claim is 0-50% more to run in the cloud, someone is doing the math properly and aiming for a fair comparison.When the claim is that cloud is cheaper than on-prem, you are probably talking to a cloud vendor account manager whose colleagues are wincing at the fact that they just torched their credibility.

评论 #33125513 未加载

评论 #33124700 未加载

thayneover 2 years ago

> Most scientific computing runs on queues. These queues can be months long for the biggest supercomputersThat sounds very much like an argument for a cloud. Instead of waiting months to do your processing, you spin up what you need, then tear it down when you are done.

评论 #33124280 未加载

adamsb6over 2 years ago

I've never worked in this space, but I'm curious about the need for massive egress. What's driving the need to bring all that data back to the institution?Could whatever actions have to be performed on the data also be performed in AWS?Also while briefly looking into this I found that AWS has an egress waiver for researchers and educational instiutions: <a href="https://aws.amazon.com/blogs/publicsector/data-egress-waiver-available-for-eligible-researchers-and-institutions/" rel="nofollow">https://aws.amazon.com/blogs/publicsector/data-egress-waiver...</a>

评论 #33124916 未加载

评论 #33128890 未加载

评论 #33127538 未加载

Fomiteover 2 years ago

One of the aspects not touched on for this is PII/confidential data/HIPAA data, etc.For that, whether it makes sense or not, a lot of universities are moving to AWS, and the infrastructure cost of AWS for what would be a pretty modest server are still considerably less than the cost of complying with the policies and regulations involved in that.Recently at my institution I asked about housing it on premise, and the answer was that IT supports AWS, and if I wanted to do something else, supporting that - as well as the responsibility for a breach - would rest entirely on my shoulders. Not doing that.

citizenpaulover 2 years ago

No one seems to consider colo data centers anymore as even an option?

评论 #33124969 未加载

somesortofthingover 2 years ago

The author makes a convincing argument against doing this workload on on-demand instances, but what about spot instances? AWS explicitly calls out scientific computing as a major use case for scientific computing in its training/promotional materials. Given the advertised ~70-90% markdown on spot instance time, it seems like a great option compared to paying almost the same amount as the workstation but not having to pay to buy, maintain, or replace the hardware.

评论 #33127081 未加载

twawaaayover 2 years ago

On the other hand it makes sense if you just need to borrow their infrastructure for a while to calculate something.A lot of scientific computing isn't happening continuously, and a lot of it is one time experiment or maybe couple of times after which you would have to tear down and reassign.Another fun fact people forget is our ability to predict future is still pretty poor. Not only that, we are biased towards thinking we can predict it when in fact this is complete bullshit.You have to buy and set up infrastructure before you can use it and then you have to be ready to use it. What if you are not ready? What if you will not need as much resources? What if you stop needing it earlier than you thought? When you borrow it from AWS you have flexibility to start using it when you are ready and drop it immediately when you no longer need it. Which has value on its own.At the company I work for we found out and basically banned signing long term contracts for discounts. We found that, on average, we pay many times more for unused services than whatever we gained through discounts. Also when you pay for the resources there is incentive to improve efficiency. When you have basically prepaid for everything that incentive is very small and is basically limited to making sure you have to stay within limits.

renewiltordover 2 years ago

> Most scientific computing runs on queues. These queues can be months long for the biggest supercomputers – that's the API equivalent of storing your inbound API requests, and then responding to them months laterMakes sense if the jobs are all low urgency.We have a similar problem in trading so we have a composite solution with non-cloud simulation hardware and additional AWS hardware. That's because we have the high utilization solution combined with high urgency.

评论 #33125926 未加载

kortexover 2 years ago

What does the landscape look like now for "terraform for bare metal"?. Is ansible/chef still the main name in town? I just wanna netboot some lightweight image, set up some basic network discovery on a control plane, and turn every connected box into a flexible worker bee I can deploy whatever cluster control layer (k8s/nomad) on top of and start slinging containers.

评论 #33126490 未加载

bee_riderover 2 years ago

Is genomic code typically distributed-memory parallel? I'm under the impression that it is more like batch processing, not a ton of node-to-node communication but you want lots of bandwidth and storage.If you are doing a big distributed-memory numerical simulation, on the other hand, you probably want infiniband I guess.AWS seems like an OK fit for the former, maybe not great for the latter...

评论 #33124010 未加载

COGloryover 2 years ago

>a month-long DNA sequencing project can generate 90 TB of dataOur EM facility generates 10 TB of raw data per day, and once you start computing it, that increases by 30%-50% depending on what you do with it. Plus, moving between network storage and local scratch for computational steps basically never ends and keeps multiple 10 Gbe links saturated 100% of the time.

wenbinover 2 years ago

A trend I've seen on HN over past few years is that people love showing off how they are able to save money by spending more of their own time, especially on infra/cloud things - if you calculate your own hourly rate correctly, it's oftentimes more costly to DIY than outsourcing to experts (e.g., managed cloud).

Areading314over 2 years ago

AWS is one tool but its a lot like the proprietary computing ecosystems that have existed for a long time (remember the Micro$oft days?). It offers convenience in return for lock in and very high margins. Theres no clear answer but its definitely not a clear cut decision where AWS is guaranteed to save money.There are 2 major costs that are overlooked by the in-house crowd, which are operational maintenance cost (an increasingly rare and expensive skillset), and also the cost of downtime -- how much does it cost you when your team of data scientists are blocked because of a failed OS update etc. That being said, hiring competent people to maintain AWS properly isn't cheap either -- and it is quite easy to start running up very wasteful AWS bills on things you don't need.As always there's a tradeoff -- the key is to choose a path and to execute it well.

jrm4over 2 years ago

I imagine what makes this especially hard is you have (at least) three parties in play here:- the people doing the research- the institution's IT services group- the administrator who writes the checksAnd in my experience, "actual knowledge of what must be done and what it will or could cost" can vary greatly across these three groups; frequently in very unintuitive ways.

评论 #33125869 未加载

CreRecombinaseover 2 years ago

These MPI-based scientific computing applications make up a bulk of the compute hours on hpc clusters, but there is a crazy long tail of scientists who have workloads that can’t (or shouldn’t) run on their personal computers. The other option is HPC. This sucks for a ton of reasons, but I think the biggest one is that it’s more or less impossible to set up a persistent service of any kind. So no databases; if you want spark, be ready to spin it up from nothing every day (also no HDFS unless you spin that up in your SLURM job too). This makes getting work done harder but it also means that it makes integrating existing work so much harder because everyone’s workflow involves reinventing everything, and everyone does it in subtly incompatible ways; there are no natural (common) abstraction layers because there are no services.

0xbadcafebeeover 2 years ago

AWS is fantastic for scientific computing. With it you can:- Deploy a thousand servers with GPUs in 10 minutes, churn over a giant dataset, then turn them all off again. Nobody ever has to wait for access to the supercomputer.- Automatically back up everything into cold storage over time with a lifecycle policy.- Avoid the massive overhead of maintaining HPC clusters, labs, data centers, additional staff and training, capex, load estimation, months/years of advance planning to be ready to start computing.- Automation via APIs to enable very quick adaptation with little coding.- An entire universe of services which ramp up your capabilities to analyze data and apply ML without needing to build anything yourself.- A marketplace of B2B and B2C solutions to quickly deploy new tools within your account.- Share data with other organizations easily.AWS costs are also "retail costs". There are massive savings to be had quite easily.

评论 #33125911 未加载

wencover 2 years ago

Calculating costs based on sticker price is sometimes misleading because there’s another variable: negotiated pricing, which can be much much lower than sticker prices, depending on your negotiating leverage. Different companies pay different prices for the same product.If you’ve ever worked at a big company or university (any place where you spend at scale), you’ll know you rarely pay sticker price. Software licensing is particularly elastic because it’s almost zero marginal cost. Raw cloud costs are largely a function of energy usage and amortized hardware costs — there’s a certain minimum you can’t go under but there remains a huge margin that is open to be negotiated on.Startups/individuals rarely even think about this because they rarely qualify. But big orgs with large spends do. You can get negotiated cloud pricing.

评论 #33125087 未加载

didipover 2 years ago

No way. I vehemently disagree.When a company reached a certain mass, hardware cost is a factor that is considered but not a big factor.The bigger problems are lost opportunity costs and unnecessary churns.Businesses lose a lot when the product launch is delayed by a year simply because the hardware arrived late or have too many defects (Ask your hardware fulfillment people how many defective RAM and SSD they got per new shipment).Churn can cost the business a lot as well. For example, imagine the model that everyone been using is trained in a Mac Pro under XYZ desk. And then when XYZ quit, s/he never properly backup the code and the model.Bare metal allows for sloppiness that the cloud cannot afford to allow. Accountability and ownership is a lot more apparent in the cloud.

评论 #33133973 未加载

osigurdsonover 2 years ago

There is a lot of discussion about supercomputers in this article. I don't think public cloud providers can compete easily with traditional super computers because they are built for optimal processing of extremely large scale MPI workloads. Such workloads are not common so I expect that public cloud providers wouldn't bother optimizing for this niche use case (though I know they all have offerings). Also when you are only optimizing for a single variable (i.e. speed), you can make design choices that would be impossible to make in a more general situation.Of course, not all scientific computing workloads require a traditional supercomputer. In fact, I suspect most do not.

Mave83over 2 years ago

I agree with the article. We at croit.io support customers around the globe to build their clusters and save huge amounts. For example, AWS S3 compared to Ceph S3 in any data center of your choice is around 1/10 of the AWS price.

lowbloodsugarover 2 years ago

>Even 2.5x over building your own infrastructure is significant for a $50M/yr supercomputer.Can’t imagine you are paying public prices on any cloud provider if you have a $50M/yr budget.In addition, if, as the article states, the scientists are ok to wait some considerable time for results, then one can run most, if not all, on spot instances, and that can save 10x right there.If you don’t have $50M/yr there are companies that will move your workload around different AWS regions to get the best price - and will factor in the cost of transferring the data too.I was architect at large scientific company using AWS.

评论 #33127428 未加载

wistloover 2 years ago

Database analyst for a large communication company here.I have similar doubts about AWS for certain kinds of intensive business analysis. Not API based transactions, but back-office analysis where complex multi-join queries are run in sequence against tables with 10s of millions of records.We do some of this with SQL servers running right on the desktop (and one still uses Excel with VLOOKUP). We have a pilot project to try these tasks in a new Azure instance. I look forward to seeing how it performs, and at what cost.

评论 #33124576 未加载

nharadaover 2 years ago

I'd love to buy my own servers for small-scale (i.e. startup size or research lab size) projects, but it's very hard to be utilizing them 24x7. Does anyone know of open-source software or tools that allow multiple people to timeshare one of these? A big server full of A100s would be awesome, with the ability to reserve the server on specific days.

评论 #33127367 未加载

评论 #33127149 未加载

stuntkiteover 2 years ago

If you pay $500 to form an LLC with Stripe Atlas you get $10,000 worth of AWS credits that can be used any way you want. It's a pretty solid way to cost effectively do scientific computing even if you need like five companies.If a policy change is made because of this comment. I'm sorry. For sure let me know though. I'll put it on my resume.

avereveardover 2 years ago

> Hardware is amortized over five yearshardware running 100% won't last five yearsif hardware is not needed to be running 100% at full steam for five years, you can turn down instances on the cloud and you don't pay anythingin 2 years you'll be stuck with the same hardware, while on the cloud you follow cpu evolution as it arrives to the providerall in all the comparison is too high level to be useful

评论 #33126674 未加载

评论 #33126527 未加载

评论 #33127310 未加载

bgroover 2 years ago

When I was looking at AWS for personal use, I first thought it was oddly expensive even when factoring in not having to buy the hardware. When I looked at just what the electricity cost to run it myself would be, I think that addition alone turned out AWS was actually cheaper. This is without factoring in cooling / dedicated space / maintenance.

fulafelover 2 years ago

Most of the comments get fixated on the most and least expensive options in this (AWS where you pay through the nose / own DC where you'll get bad service from your institution and have to fight with hw procurement stuff). What about the middle ones presented, the more reasonably priced cloud/rental server providers?

snorkelover 2 years ago

Buying your own fleet of dedicated servers seems like a smart move in the short term, but then five years from now you’ll get someone on the team insisting that they need the latest greatest GPU to run their jobs. Cloud providers give you the option of using newer chipsets without having to re-purchase your entire server fleet every five years.

评论 #33125031 未加载

gammaratorover 2 years ago

Astronomy is moving more and more to cloud computing:<a href="https://www.nature.com/articles/d41586-020-02284-7" rel="nofollow">https://www.nature.com/articles/d41586-020-02284-7</a><a href="https://arxiv.org/abs/1907.06320" rel="nofollow">https://arxiv.org/abs/1907.06320</a>

pacerierover 2 years ago

Very odd article (to be written by a scientist)—Shouldn't it be comparing with GCE? Doesn't make sense to compare on cost against AWS instead of GCP, except.. for wow numbers and moar clicks."Linkedin doesn't make sense for connecting with friends".

epberryover 2 years ago

I can tell you that NASA is in the midst of a multi-year effort to move their computing to AWS and that yes.. downloading 324 terabytes of data is very expensive but very soon all this data will just remain in the cloud and be accessed virtually.

thesausagekingover 2 years ago

I'm suspicious of the author's actual experience.The fact that scientific computing has a different pattern than the typical web app is actually a good thing. If you can architect large batch jobs to use spot instances, it's 50-80% cheaper.Also this bit: "you can keep your servers at 100% utilization by maintaining a queue of requested jobs" isn't true in practice. The pattern of research is the work normally comes in waves. You'll want to train a new model or run a number of large simulations. And then there will be periods of tweaking and work on other parts. And then more need for a lot of training. Yes, you can always find work to put on a cluster to keep it >90% utilization, but if it can be elastic (and has compute has budget attached to it), it will rise and fall.

评论 #33124502 未加载

myuzioover 2 years ago

I'm surprised no-one mentioned Amazon Lightsail by now. Anyway, yes, AWS can be super expensive especially if you don't know what you're doing, regardless of type of processing.

jerjerjerover 2 years ago

Sure? I mean, if you have:1) A large enough queue of tasks2) Users/donstream willing to waitusing your own infrastructure always wins (alsuming free labor) since you can load your own infrastructure to ~95% pretty much 24/7 which is unbeatable.

评论 #33123319 未加载

mileswardover 2 years ago

Article lost me at "they'll just wait longer". Get out of line. Run what you need, when you need it, and get back to the science, the good part!

betolinkover 2 years ago

I see both sides of the argument, there is a reason why CERN is not processing their data using EC2 and lambdas.

评论 #33125609 未加载

评论 #33126919 未加载

latchkeyover 2 years ago

I read this as a thinly veiled advertisement for the authors service, toolchest.

评论 #33124967 未加载

timeuover 2 years ago

As an HPC sysadmin for 3 research institutes (mostly life sciences & biology) I can't see how cloud HPC system could be any cheaper than an on-prem HPC system especially if I look at the resource efficiency (how much resources were requested vs how much were actually useed) of our users SLURM jobs. Often the users request 100s of GB but only use a fraction of it. In our on-prem HPC system this might decrease utilization (which is not great) but in the case of the cloud this would result in increased computing costs (because bigger VM flavor) which would be probably worse (CapEx vs OpEx) Of course you could argue that the users should do and know better and properly size/measure their resource requirements however most of our users have lab background and are new to computational biology so estimating or even knowing what all the knobs (cores, mem per core, total memory, etc) of the job specification means is hard for them. We try to educate by providing trainings and job efficency reporting however the researchters/users have little incentive to optimize the job requests and are more interested in quick results and turnover which is also understandable (the on-prem HPC system is already payed for). Maybe the cost transparancy of the cloud would force them or rather their group leaders/institute heads to put a focus on this but until you move to the cloud you won't know.Additionally the typical workloads that run on our HPC system are often some badly maintained bioinformatics software or R/perl/pythong throwaway scripts and often enough a typo in the script causes the entire pipeline to fail after days of running on the HPC system and needs to be restarted (maybe even multiple times). Again on the on-prem system you have wasted electricity (bad enough) but in the cloud you have to pay the computing costs of the failed runs. Again cost transparency might force to fix this but the users are not software engineers.One thing that the cloud is really good at, is elasticity and access to new hardware. We have seen for example a shift of workloads from pure CPUs to GPUs. A new CryoEM microscope was installed where the downstream analysis is relying heavily on GPUs, more and more resaerch groups run Alpafold predictions and also NGS analysis is now using GPUs. We have around 100 GPUs and average utlizations has increased to 80-90% and the users are complaining about long waiting/queueing times for their GPU jobs. For this bursting to the cloud would be nice, however GPUs are prohibitively expensive in the cloud unfortunately and the above mentioned caveats regarding job resource efficiencies still apply.One thing that will hurt on-prem HPC systems tough are the increased electricity prices. We are now taking measures to actively save energy (i.e. by powering down idle nodes and powering them up again when jobs are scheduled). As far as I can tell the big cloud providers (AWS, etc) haven't increased the prices yet either because they cover elecriticity cost increase with their profit margins or they are not affected as much because they have better deals with elecricity providers.

评论 #33128680 未加载

alpineidyll3over 2 years ago

We'll put. My company is current suffering through this...

jiggawattsover 2 years ago

I've just spent the last week investigating cloud compute options for a lab that needs to run bioinformatics / genomics algorithms.First off, the pricing in the article is so disingenuous as to be outright deception.Here is the Spot price for Azure HB120rs_v2, a popular HPC size with 120 AMD EPYC cores and 456 GB of RAM: <a href="https://azureprice.net/vm/Standard_HB120rs_v2?tier=spot&currency=USD" rel="nofollow">https://azureprice.net/vm/Standard_HB120rs_v2?tier=spot&curr...</a>This is less than $300/month for 2.5x the compute capacity he's referencing! The author's estimate is $200/month for an on-prem server with just 48 cores. Scaled down to that level, the equivalent in cloud spot pricing would be $120.That's assuming on-prem is 100% utilised and the cloud compute is not auto-scaled. If those assumptions are lifted, the cloud is much cheaper.The cloud makes sense in several other ways also:- Once the data is in cloud storage like S3 or Azure Storage Accounts, sharing it with government departments, universities, or other research institutes is trivial. Just send them a SAS URL and they can probably download it at 1GB/s without killing the Internet link at the source.- Many of these processes have 10 GB inputs that produce about 1 TB of output due to all the intermediate and temporary files. These are often kept for later analysis, but they're of low value and go cold very quickly. Tiered storage in the cloud is very easy to set up and dirt cheap compared to on-prem network attached storage. These blobs can be moved to "Cold" storage within a few days, and then to "Archive" within a month or two at most.- The algorithms improve over time, at which point it would be oh-so-nice to be able to re-run them over the old multi-petabyte data sets. But on-prem, this is an extravagance, and needs a lot of justification. In the cloud, you can just spin up a large pool of Spot instances with a low price cap, and let it chunk through the old data when it can. Unlike on-prem, this can read the old data in much faster, easily up to 30-100 Gbps in my tests. Good luck building a disk array that can stream 100 Gbps and also have good performance for high-priority workloads!- The hardware is evolving much more rapidly than typical enterprise purchase cycles. We have a customer that is about to buy one (1) NVIDIA A100 GPU to use for bioinformatics. In a matter of months, it'll be superseded by the NVIDIA "Hopper" H100 series, which is 7x faster for the same genomics codes. In the cloud, both AWS and Azure will soon have instances with four H100 cards in them. That'll be 28 times faster than one A100 card, making the on-prem purchase obsolete years before the warranty runs out. A couple of years later when then successor to H100 is available in the cloud, these guys will still be using the A100!- The cloud provides lots of peripheral services that are a PitA to set up, secure, and manage locally. For example, EKS or AKS are managed Kubernetes clusters that can be used to efficiently bin-pack HPC compute jobs and restart jobs on Spot instances if they're deallocated. Similarly, Azure CycleCloud provides managed Slurm clusters with auto-scale and spot pricing. For Docker workloads there are managed container registries, and both single-instance and scalable "container apps" that work quite well for one-off batch jobs, Jupyter notebooks, and the like.- In the cloud, it's easy to temporarily spin up a true HPC cluster with 200 Gbps Infiniband and a matching high-performance storage cache. It's like a tiny supercomputer, rented by the hour. On-prem, just buying a single Infiniband switch will set you back more than $30K, and it'll be just the chassis. No cables, SFPs, or host adapters. A full setup is north of $100K. Good luck buying "cheap" storage that can keep up with that network!Etc, etc...

评论 #33129698 未加载

hellodanyloover 2 years ago

[retracted]

评论 #33126183 未加载

评论 #33126180 未加载

xani_over 2 years ago

It always was for load that doesn't allow for autoscaling to save you money; the savings were always from convenience of not having to do ops and pay for ops.Then again a part of ops cost you save is paid again in dev salary that have to deal with AWS stuff instead of just throwing a blob of binaries and letting ops worry about the rest.