This is not a memory reduction technique that's somehow magical. Well, it does manage memory with some clever scheduling. The core of this idea is that you can schedule out inference on edge nodes in a memory and bandwidth optimized way that's a bit different than just splitting layers.<p>They propose that right now computation and latency dominate the costs for multi-node inference, and pick a network topology (star) that is savvy to that.<p>That said, it's 26-29 seconds per token for llama2-70b with their 8 edge devices, each using 4 gigs of RAM. That's amazing that they can run it at all, but this isn't going to be viable at the edge with current hardware.<p>I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.<p>Upshot - interesting paper -- smart ideas, large frontier models still need very exotic hardware and bandwidth interconnects - this may point a way forward to lowering the bandwidth interconnects part of the story.
While I do think there's going to be a huge market for cloud-based LLM serving, the fact that consumer hardware can run close to SOTA models fairly easily (e.g. high-RAM MBP config), seems to me that the provider market won't be as big as investors are betting on.<p>Most of the rewards will be reaped by consumers rather than providers.<p>We're also in an age where the current levels of RAM in consumer devices were almost entirely optimized for prior to the existence of LLMs. I find it highly likely vendors will optimize for higher RAM capacity over other priorities in future hardware.<p>How long until a 256GB RAM laptop (shared with GPU) is reasonably cheap/available? I give it a few years at most.<p>It's possible that models grow orders of magnitude larger, but I find it more likely that the size of models will grow along the curve of cost of training decreasing/hardware cost improvements. There will be a sweet spot where it's economical to train larger models, and private companies won't push much beyond that.
It would be nice for the inference time to be paired with measure of output quality. I'm not well versed in how the architecture works, but I have a hard time believing a 90% reduction in peak memory footprint comes cost-free.
Is this different from (or related to) the work being done by the exo project?<p><a href="https://github.com/exo-explore/exo">https://github.com/exo-explore/exo</a>
While training seems to be out of reach for the average tech user unless they have a data center for a homelab or a very large income, SOTA models can be easily run on the edge devices either on a phone or a dedicated computer/server.<p>LocalLLAMA and the fact the open weights and open datasets have really helped show that these can be done if you have enough resources and motivation.