TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

26× Faster Inference with Layer-Condensed KV Cache for Large Language Models

127 点作者 georgehill大约 1 年前

7 条评论

vessenes大约 1 年前
Upshot of the paper -- right now KV caches are implemented for multiple deep layers in LLMs. Why not just the top layer? It would save memory.<p>Initial result -- those KV caches in lower layers matter, and output suffered.<p>Updated plan -- cull half the KV layers! This works &#x27;nearly&#x27; as well as keeping all of them, with memory and compute savings.<p>Downside - triple the training, worse out of band &#x2F; long context performance.<p>This feels to me like a technique you&#x27;d use on a particular architecture deployed at the edge where compute matters and you have a little extra room on performance. Phi-3 on raspberry pi, basically.<p>Interesting! As always, I wish models showed prompt output in their papers, not just perplexity numbers. But, here we are.
评论 #40421303 未加载
评论 #40420271 未加载
评论 #40421077 未加载
WhitneyLand大约 1 年前
Not sure if @dang is the right way to say the title is incorrect here, but shouldn&#x27;t it match the paper?<p>1. The correct title is Layer-Condensed KV Cache for Efficient Inference of Large Language Models.<p>2. The paper does make a 26x claim later in the introduction, but it’s an outlier.<p>26x is for only one benchmark and that benchmark is CPU based, not GPU based like 99% of transformer loads actually run on.<p>If you look at GPU only workloads, the improvements range from 1.4x to 4.7x.
评论 #40429956 未加载
评论 #40424801 未加载
评论 #40419528 未加载
评论 #40422942 未加载
joaquincabezas大约 1 年前
LLM inference optimization has been key for the OpenAI GPT-4o presentation (2x faster, 50% cheaper) and its driving lots of industry research because it’s direct cost savings, but it’s refreshing to see so many techniques published as papers (i.e from Stanford, Berkeley…)
jasonjmcghee大约 1 年前
&gt; please use the original title, unless it is misleading or linkbait; don&#x27;t editorialize.<p>&quot;Layer-Condensed KV Cache for Efficient Inference of Large Language Models&quot;
jsemrau大约 1 年前
&quot;Our implementation is based on HuggingFace transformers where we register a new model opt-llama that supports the Layer-Condensed KV Cache.&quot;<p>Not sure what this means? Would this work for a Mistral model as well?
tripplyons大约 1 年前
This can be combined with Grouped Query Attention or Multi-Query Attention for an even further reduction in the size of the KV Cache!
评论 #40418985 未加载
评论 #40425747 未加载
vlovich123大约 1 年前
Is the KV cache something that runs on the GPU or on the CPU? Or traditionally on the CPU &amp; this enables it to run on the GPU?
评论 #40425159 未加载
评论 #40419705 未加载