科技回声

14 条评论

The subtitle (which is important but was too long for the HN submission) is "A high-level guide to GPU utilization".

I’m at the integration testing and benchmarking phase of a rust crate for allocating LLMs to GPUs and System RAM. The impetus is that single models are limited in what they can achieve and more complex workflows require LLM swarms. Think of a lot of smaller models doing reasoning steps or tasks like search and then a big model to summarize it all.It allocates a range of quants for a model across N number of devices using DFS to find the most ideal allocation for the given input of models. Ideal here meaning the most tokens per second and the least time to initialize the allocation. I keep track of memory capacity, pice bandwidth, and link bandwidth (including nvlink).I intend to serve this behind an api using llamacpp so you can send a request to the api and it will fetch the model to fulfill the request, or create a new allocation to accommodate. Sort of like llama swap, but explicitly with the goal of enabling as many LLMs as you need to run on your hardware.Anyways, I just bring this up because I’m curious if anyone else has done something like this? Or if it’s a problem worth solving? My dream is to take it out of my bedroom server and run in on something like modal.

charles_irl6 天前

Oh, I wrote this! Thanks for sharing it.

评论 #43921469 未加载

Mockapapella6 天前

This is a good article on the "fog of war" for GPU inference. Modal has been doing a great job of aggregating and disseminating info on how to think about high quality AI inference. Learned some fun stuff -- thanks for posting it.> the majority of organizations achieve less than 70% GPU Allocation Utilization when running at peak demand — to say nothing of aggregate utilization. This is true even of sophisticated players, like the former Banana serverless GPU platform, which operated at an aggregate utilization of around 20%.Saw this sort of thing at my last job. Was very frustrating pointing this out to people only for them to respond with ¯\_(ツ)_/¯. I posted a much less tactful article (read: rant) than the one by Modal, but I think it still touches on a lot of the little things you need to consider when deploying AI models: <a href="https://thelisowe.substack.com/p/you-suck-at-deploying-ai-models" rel="nofollow">https://thelisowe.substack.com/p/you-suck-at-deploying-ai-mo...</a>

评论 #43921345 未加载

alexjplant6 天前

OT: I'm not really sure what the author meant by> Graphics Processing Units, or GPUs, are the hottest mathematical co-processor since the FM synthesis chips that shaped the sounds of the 1990ssince FM was more of an 80s thing. Even their linked comment says> Throughout the 90s FM was old-hat. Nobody wanted to hear those woody clangy sounds of the 80s anymore.FM synthesis has kept being a thing ever since in specific applications but the zeitgeist of the 90s (and its modern postmodern retreads like vaporwave) is arguably digital sampling.

评论 #43927992 未加载

semessier6 天前

well, related: fractional GPUs to multiplex workloads for aggregate utilization have been a topic for some time with no definite (NVIDIA) solutions for it: <a href="https://vhpc.org" rel="nofollow">https://vhpc.org</a>

评论 #43922026 未加载

kristianpaul6 天前

I'm still trying to use all my CPUs...

drob5186 天前

And we’re back to time-sharing.

评论 #43921980 未加载

评论 #43922271 未加载

r3tr06 天前

We spend a lot of time on getting these measurement w eBPFYou can check us out at <a href="https://yeet.cx" rel="nofollow">https://yeet.cx</a>Heres an overview of our GPU specific solution<a href="https://yeet.cx/solutions/maximize-ai-infra-roi" rel="nofollow">https://yeet.cx/solutions/maximize-ai-infra-roi</a>

freeqaz6 天前

How fast are modern GPU boxes able to spin up these days? Loading in a massive blob of weights into VRAM feels like it's gotta be the bottleneck even if server provisioning is fast.Or am I naive and my knowledge is outdated? I am genuinely curious what people see and what providers are capable of in 2025.

评论 #43921752 未加载

评论 #43921531 未加载

评论 #43921822 未加载

pavelstoev6 天前

GPU sharing is a concern for sensitive data. It is more appropriate to increase the utilization rate of GPU chip internals via a variety of low-level (CUDA and below) optimizations.

评论 #43925523 未加载

cubefox6 天前

For anyone thinking this is about video games:> We’ll specifically focus on neural network inference workloads

评论 #43920876 未加载

awesome_dude6 天前

I'm old enough to remember when people would be concerned if their CPU usage went to 100%

评论 #43921231 未加载

评论 #43921195 未加载

mwilcox6 天前

Understandable

14 条评论

mooreds6 天前

The subtitle (which is important but was too long for the HN submission) is "A high-level guide to GPU utilization".

J_Shelby_J6 天前

charles_irl6 天前

Oh, I wrote this! Thanks for sharing it.

评论 #43921469 未加载

Mockapapella6 天前

评论 #43921345 未加载

alexjplant6 天前

评论 #43927992 未加载

semessier6 天前

评论 #43922026 未加载

kristianpaul6 天前

I'm still trying to use all my CPUs...

drob5186 天前

And we’re back to time-sharing.

评论 #43921980 未加载

评论 #43922271 未加载

r3tr06 天前

freeqaz6 天前

评论 #43921752 未加载

评论 #43921531 未加载

评论 #43921822 未加载

pavelstoev6 天前

GPU sharing is a concern for sensitive data. It is more appropriate to increase the utilization rate of GPU chip internals via a variety of low-level (CUDA and below) optimizations.

评论 #43925523 未加载

cubefox6 天前

For anyone thinking this is about video games:> We’ll specifically focus on neural network inference workloads

评论 #43920876 未加载

awesome_dude6 天前

I'm old enough to remember when people would be concerned if their CPU usage went to 100%

评论 #43921231 未加载

评论 #43921195 未加载

mwilcox6 天前

Understandable

'I paid for the whole GPU, I am going to use the whole GPU'

14 条评论

'I paid for the whole GPU, I am going to use the whole GPU'

14 条评论