116 点作者 georgehill大约 2 年前

2 条评论

sp332大约 2 年前

Previous conversation <a href="https://news.ycombinator.com/item?id=34869960" rel="nofollow">https://news.ycombinator.com/item?id=34869960</a>

georgehill大约 2 年前

> FlexGen lowers the resource requirements of running 175B-scale models down to a single 16GB GPU and reaches a generation throughput of 1 token/s with an effective batch size of 144.<p>I can't imagine what will be happening in LLM space next year this time. Maybe LLM natively integrated into games and browsers.

评论 #35146411 未加载

评论 #35147580 未加载

评论 #35146965 未加载

评论 #35146922 未加载

High-Throughput Generative Inference of Large Language Models with a Single GPU

2 条评论

High-Throughput Generative Inference of Large Language Models with a Single GPU

2 条评论