TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Serving 70B-scale LLMs efficiently on low-resource edge devices [pdf]

248 pointsby simonpure8 months ago

7 comments

vessenes8 months ago
This is not a memory reduction technique that&#x27;s somehow magical. Well, it does manage memory with some clever scheduling. The core of this idea is that you can schedule out inference on edge nodes in a memory and bandwidth optimized way that&#x27;s a bit different than just splitting layers.<p>They propose that right now computation and latency dominate the costs for multi-node inference, and pick a network topology (star) that is savvy to that.<p>That said, it&#x27;s 26-29 seconds per token for llama2-70b with their 8 edge devices, each using 4 gigs of RAM. That&#x27;s amazing that they can run it at all, but this isn&#x27;t going to be viable at the edge with current hardware.<p>I think the paper makes the case that you could probably recruit say your 30 graphics workstations to do much faster inference without just nailing your LAN bandwidth, though.<p>Upshot - interesting paper -- smart ideas, large frontier models still need very exotic hardware and bandwidth interconnects - this may point a way forward to lowering the bandwidth interconnects part of the story.
评论 #41732323 未加载
评论 #41732935 未加载
评论 #41732354 未加载
评论 #41734783 未加载
adam_arthur8 months ago
While I do think there&#x27;s going to be a huge market for cloud-based LLM serving, the fact that consumer hardware can run close to SOTA models fairly easily (e.g. high-RAM MBP config), seems to me that the provider market won&#x27;t be as big as investors are betting on.<p>Most of the rewards will be reaped by consumers rather than providers.<p>We&#x27;re also in an age where the current levels of RAM in consumer devices were almost entirely optimized for prior to the existence of LLMs. I find it highly likely vendors will optimize for higher RAM capacity over other priorities in future hardware.<p>How long until a 256GB RAM laptop (shared with GPU) is reasonably cheap&#x2F;available? I give it a few years at most.<p>It&#x27;s possible that models grow orders of magnitude larger, but I find it more likely that the size of models will grow along the curve of cost of training decreasing&#x2F;hardware cost improvements. There will be a sweet spot where it&#x27;s economical to train larger models, and private companies won&#x27;t push much beyond that.
评论 #41732992 未加载
loufe8 months ago
It would be nice for the inference time to be paired with measure of output quality. I&#x27;m not well versed in how the architecture works, but I have a hard time believing a 90% reduction in peak memory footprint comes cost-free.
评论 #41731686 未加载
评论 #41731651 未加载
评论 #41731626 未加载
评论 #41731673 未加载
评论 #41731511 未加载
Zetaphor8 months ago
Is this different from (or related to) the work being done by the exo project?<p><a href="https:&#x2F;&#x2F;github.com&#x2F;exo-explore&#x2F;exo">https:&#x2F;&#x2F;github.com&#x2F;exo-explore&#x2F;exo</a>
评论 #41732362 未加载
tonetegeatinst8 months ago
While training seems to be out of reach for the average tech user unless they have a data center for a homelab or a very large income, SOTA models can be easily run on the edge devices either on a phone or a dedicated computer&#x2F;server.<p>LocalLLAMA and the fact the open weights and open datasets have really helped show that these can be done if you have enough resources and motivation.
dvh8 months ago
So when will I be able to &quot;sudo apt-get install llm&quot; ?
评论 #41731967 未加载
评论 #41732263 未加载
评论 #41732007 未加载
评论 #41732129 未加载
评论 #41731939 未加载
tgtweak8 months ago
Is there a cuda implementation of this... asking for a friend