科技回声

7 条评论

vessenes大约 1 年前

Upshot of the paper -- right now KV caches are implemented for multiple deep layers in LLMs. Why not just the top layer? It would save memory.Initial result -- those KV caches in lower layers matter, and output suffered.Updated plan -- cull half the KV layers! This works 'nearly' as well as keeping all of them, with memory and compute savings.Downside - triple the training, worse out of band / long context performance.This feels to me like a technique you'd use on a particular architecture deployed at the edge where compute matters and you have a little extra room on performance. Phi-3 on raspberry pi, basically.Interesting! As always, I wish models showed prompt output in their papers, not just perplexity numbers. But, here we are.

评论 #40421303 未加载

评论 #40420271 未加载

评论 #40421077 未加载

WhitneyLand大约 1 年前

Not sure if @dang is the right way to say the title is incorrect here, but shouldn't it match the paper?1. The correct title is Layer-Condensed KV Cache for Efficient Inference of Large Language Models.2. The paper does make a 26x claim later in the introduction, but it’s an outlier.26x is for only one benchmark and that benchmark is CPU based, not GPU based like 99% of transformer loads actually run on.If you look at GPU only workloads, the improvements range from 1.4x to 4.7x.

评论 #40429956 未加载

评论 #40424801 未加载

评论 #40419528 未加载

评论 #40422942 未加载

joaquincabezas大约 1 年前

LLM inference optimization has been key for the OpenAI GPT-4o presentation (2x faster, 50% cheaper) and its driving lots of industry research because it’s direct cost savings, but it’s refreshing to see so many techniques published as papers (i.e from Stanford, Berkeley…)

jasonjmcghee大约 1 年前

> please use the original title, unless it is misleading or linkbait; don't editorialize."Layer-Condensed KV Cache for Efficient Inference of Large Language Models"

jsemrau大约 1 年前

"Our implementation is based on HuggingFace transformers where we register a new model opt-llama that supports the Layer-Condensed KV Cache."Not sure what this means? Would this work for a Mistral model as well?

tripplyons大约 1 年前

This can be combined with Grouped Query Attention or Multi-Query Attention for an even further reduction in the size of the KV Cache!

评论 #40418985 未加载

评论 #40425747 未加载

vlovich123大约 1 年前

Is the KV cache something that runs on the GPU or on the CPU? Or traditionally on the CPU & this enables it to run on the GPU?

评论 #40425159 未加载

评论 #40419705 未加载

7 条评论

vessenes大约 1 年前

评论 #40421303 未加载

评论 #40420271 未加载

评论 #40421077 未加载

WhitneyLand大约 1 年前

评论 #40429956 未加载

评论 #40424801 未加载

评论 #40419528 未加载

评论 #40422942 未加载

joaquincabezas大约 1 年前

jasonjmcghee大约 1 年前

> please use the original title, unless it is misleading or linkbait; don't editorialize."Layer-Condensed KV Cache for Efficient Inference of Large Language Models"

jsemrau大约 1 年前

tripplyons大约 1 年前

This can be combined with Grouped Query Attention or Multi-Query Attention for an even further reduction in the size of the KV Cache!

评论 #40418985 未加载

评论 #40425747 未加载

vlovich123大约 1 年前

Is the KV cache something that runs on the GPU or on the CPU? Or traditionally on the CPU & this enables it to run on the GPU?

评论 #40425159 未加载

评论 #40419705 未加载

26× Faster Inference with Layer-Condensed KV Cache for Large Language Models

7 条评论

26× Faster Inference with Layer-Condensed KV Cache for Large Language Models

7 条评论