I’m at the integration testing and benchmarking phase of a rust crate for allocating LLMs to GPUs and System RAM. The impetus is that single models are limited in what they can achieve and more complex workflows require LLM swarms. Think of a lot of smaller models doing reasoning steps or tasks like search and then a big model to summarize it all.<p>It allocates a range of quants for a model across N number of devices using DFS to find the most ideal allocation for the given input of models. Ideal here meaning the most tokens per second and the least time to initialize the allocation. I keep track of memory capacity, pice bandwidth, and link bandwidth (including nvlink).<p>I intend to serve this behind an api using llamacpp so you can send a request to the api and it will fetch the model to fulfill the request, or create a new allocation to accommodate. Sort of like llama swap, but explicitly with the goal of enabling as many LLMs as you need to run on your hardware.<p>Anyways, I just bring this up because I’m curious if anyone else has done something like this? Or if it’s a problem worth solving? My dream is to take it out of my bedroom server and run in on something like modal.