TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

'I paid for the whole GPU, I am going to use the whole GPU'

154 点作者 mooreds6 天前

14 条评论

mooreds6 天前
The subtitle (which is important but was too long for the HN submission) is "A high-level guide to GPU utilization".
J_Shelby_J6 天前
I’m at the integration testing and benchmarking phase of a rust crate for allocating LLMs to GPUs and System RAM. The impetus is that single models are limited in what they can achieve and more complex workflows require LLM swarms. Think of a lot of smaller models doing reasoning steps or tasks like search and then a big model to summarize it all.<p>It allocates a range of quants for a model across N number of devices using DFS to find the most ideal allocation for the given input of models. Ideal here meaning the most tokens per second and the least time to initialize the allocation. I keep track of memory capacity, pice bandwidth, and link bandwidth (including nvlink).<p>I intend to serve this behind an api using llamacpp so you can send a request to the api and it will fetch the model to fulfill the request, or create a new allocation to accommodate. Sort of like llama swap, but explicitly with the goal of enabling as many LLMs as you need to run on your hardware.<p>Anyways, I just bring this up because I’m curious if anyone else has done something like this? Or if it’s a problem worth solving? My dream is to take it out of my bedroom server and run in on something like modal.
charles_irl6 天前
Oh, I wrote this! Thanks for sharing it.
评论 #43921469 未加载
Mockapapella6 天前
This is a good article on the &quot;fog of war&quot; for GPU inference. Modal has been doing a great job of aggregating and disseminating info on how to think about high quality AI inference. Learned some fun stuff -- thanks for posting it.<p>&gt; the majority of organizations achieve less than 70% GPU Allocation Utilization when running at peak demand — to say nothing of aggregate utilization. This is true even of sophisticated players, like the former Banana serverless GPU platform, which operated at an aggregate utilization of around 20%.<p>Saw this sort of thing at my last job. Was very frustrating pointing this out to people only for them to respond with ¯\_(ツ)_&#x2F;¯. I posted a much less tactful article (read: rant) than the one by Modal, but I think it still touches on a lot of the little things you need to consider when deploying AI models: <a href="https:&#x2F;&#x2F;thelisowe.substack.com&#x2F;p&#x2F;you-suck-at-deploying-ai-models" rel="nofollow">https:&#x2F;&#x2F;thelisowe.substack.com&#x2F;p&#x2F;you-suck-at-deploying-ai-mo...</a>
评论 #43921345 未加载
alexjplant6 天前
OT: I&#x27;m not really sure what the author meant by<p>&gt; Graphics Processing Units, or GPUs, are the hottest mathematical co-processor since the FM synthesis chips that shaped the sounds of the 1990s<p>since FM was more of an 80s thing. Even their linked comment says<p>&gt; Throughout the 90s FM was old-hat. Nobody wanted to hear those woody clangy sounds of the 80s anymore.<p>FM synthesis has kept being a thing ever since in specific applications but the zeitgeist of the 90s (and its modern postmodern retreads like vaporwave) is arguably digital sampling.
评论 #43927992 未加载
semessier6 天前
well, related: fractional GPUs to multiplex workloads for aggregate utilization have been a topic for some time with no definite (NVIDIA) solutions for it: <a href="https:&#x2F;&#x2F;vhpc.org" rel="nofollow">https:&#x2F;&#x2F;vhpc.org</a>
评论 #43922026 未加载
kristianpaul6 天前
I&#x27;m still trying to use all my CPUs...
drob5186 天前
And we’re back to time-sharing.
评论 #43921980 未加载
评论 #43922271 未加载
r3tr06 天前
We spend a lot of time on getting these measurement w eBPF<p>You can check us out at <a href="https:&#x2F;&#x2F;yeet.cx" rel="nofollow">https:&#x2F;&#x2F;yeet.cx</a><p>Heres an overview of our GPU specific solution<p><a href="https:&#x2F;&#x2F;yeet.cx&#x2F;solutions&#x2F;maximize-ai-infra-roi" rel="nofollow">https:&#x2F;&#x2F;yeet.cx&#x2F;solutions&#x2F;maximize-ai-infra-roi</a>
freeqaz6 天前
How fast are modern GPU boxes able to spin up these days? Loading in a massive blob of weights into VRAM feels like it&#x27;s gotta be the bottleneck even if server provisioning is fast.<p>Or am I naive and my knowledge is outdated? I am genuinely curious what people see and what providers are capable of in 2025.
评论 #43921752 未加载
评论 #43921531 未加载
评论 #43921822 未加载
pavelstoev6 天前
GPU sharing is a concern for sensitive data. It is more appropriate to increase the utilization rate of GPU chip internals via a variety of low-level (CUDA and below) optimizations.
评论 #43925523 未加载
cubefox6 天前
For anyone thinking this is about video games:<p>&gt; We’ll specifically focus on neural network inference workloads
评论 #43920876 未加载
awesome_dude6 天前
I&#x27;m old enough to remember when people would be concerned if their CPU usage went to 100%
评论 #43921231 未加载
评论 #43921195 未加载
mwilcox6 天前
Understandable