TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Punica: Serving multiple LoRA finetuned LLM as one

135 pointsby abcdabcd987over 1 year ago

11 comments

huacover 1 year ago
I think this is one of the most important possible works for open source LLM&#x27;s, really glad y&#x27;all pushed this forward!<p>That&#x27;s not hyperbole. Why is OpenAI able to charge so little for their API&#x27;s? I have heard rival mega LLM company CEO&#x27;s complain that OpenAI&#x27;s prices would be a loss for their rivals. But I think it&#x27;s still positive margin, and that they can charge low prices for API because they&#x27;ve invested more into managing the infra, sure, but most importantly because they have the best utilization of their existing hardware.<p>If it costs everyone $X&#x2F;gpu&#x2F;hr to serve models, the company that has the most throughput wins on price. In a world without finetunes, the most capable model, the one that can zero- or few-shot the most tasks will have the most usage. Finetuned open models can reach parity with GPT on narrow tasks, but until now, having public providers serve the models was expensive. Your private finetune is only going to be queried by you, not everyone, so it&#x27;s super expensive to serve on a per token level. With hot swappable LoRA adapters, that calculus changes, and the cost per token can go way down. Super, super exciting!
评论 #38202127 未加载
评论 #38201469 未加载
kcorbittover 1 year ago
Awesome work! Here&#x27;s a recent paper released yesterday, also focused on efficiently serving many LoRAs simultaneously: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2311.03285" rel="nofollow noreferrer">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2311.03285</a><p>Really looking forward to these innovations becoming more widespread -- I expect we&#x27;re very close to a world where training a LoRA on a one-off task like &quot;review every HN post from the last 3 years and flag any of them that contain informed speculation about the architecture of GPT-4&quot; will be easy, cheap and routine.
评论 #38200834 未加载
评论 #38203107 未加载
Palmikover 1 year ago
This is amazing, and will unlock many possibilities. I just recently read the S-LoRA paper, which is related, but it&#x27;s even better to have a working (and extremely efficient!) implementation.<p>How hard would it be to adapt your kernels to work with the new-gen quants like AWQ or EXL2?
评论 #38201489 未加载
vlovich123over 1 year ago
Am I correct in understanding that LoRA is basically a way to cheaply create “delta” LLMs that apply onto the main large one to create a specialization? In other words, this would obviate all the vector DB stuff that people are doing right?
评论 #38200566 未加载
评论 #38202198 未加载
yydingover 1 year ago
Good job! I observed that you implemented many cuda kernels by yourselves. Just wondering your consideration or trade-off between implementating the kernels via pure CUDA code vs. implementing based on compiler like TVM&#x2F;Triton.
评论 #38198881 未加载
j0057over 1 year ago
That name is easy to confuse with the unrelated LoRa and LoRaWAN.
评论 #38203910 未加载
lmeyerovover 1 year ago
Super cool!<p>I&#x27;m curious if there is a quality argument to be made: imagine needing to finetune k different classifiers...<p>Before this work, we could train a single multi-label classifier by pooling the training sets, and deploy as 1 LoRa<p>Now, we can have k distinct classifiers, and not risk them interfering with one another<p>Any sense of, in realistic scenarios, when the quality of k distinct LoRas would be better?
kkielhofnerover 1 year ago
Nice!<p>Any thoughts as to how this would come together with serving frameworks like vLLM, lmdeploy, Triton Inference Server, etc?
评论 #38200895 未加载
junrushao1994over 1 year ago
This is great! Have you guys considered integrating with one of the existing systems?
评论 #38198946 未加载
ruihanglover 1 year ago
Great work! I am curious that how much effort it would take to support LoRAs with different ranks?
评论 #38201569 未加载
busssardover 1 year ago
there was a word on GPT4 just being 8 different GPT3 in a trenchcoat finetuned on different topics. If we can do this now with 8x finetuned Vicuna 13b for the price of running Vicuna once, this is huge!