TechEcho

11 comments

huacover 1 year ago

I think this is one of the most important possible works for open source LLM's, really glad y'all pushed this forward!That's not hyperbole. Why is OpenAI able to charge so little for their API's? I have heard rival mega LLM company CEO's complain that OpenAI's prices would be a loss for their rivals. But I think it's still positive margin, and that they can charge low prices for API because they've invested more into managing the infra, sure, but most importantly because they have the best utilization of their existing hardware.If it costs everyone $X/gpu/hr to serve models, the company that has the most throughput wins on price. In a world without finetunes, the most capable model, the one that can zero- or few-shot the most tasks will have the most usage. Finetuned open models can reach parity with GPT on narrow tasks, but until now, having public providers serve the models was expensive. Your private finetune is only going to be queried by you, not everyone, so it's super expensive to serve on a per token level. With hot swappable LoRA adapters, that calculus changes, and the cost per token can go way down. Super, super exciting!

评论 #38202127 未加载

评论 #38201469 未加载

kcorbittover 1 year ago

Awesome work! Here's a recent paper released yesterday, also focused on efficiently serving many LoRAs simultaneously: <a href="https://arxiv.org/abs/2311.03285" rel="nofollow noreferrer">https://arxiv.org/abs/2311.03285</a>Really looking forward to these innovations becoming more widespread -- I expect we're very close to a world where training a LoRA on a one-off task like "review every HN post from the last 3 years and flag any of them that contain informed speculation about the architecture of GPT-4" will be easy, cheap and routine.

评论 #38200834 未加载

评论 #38203107 未加载

Palmikover 1 year ago

This is amazing, and will unlock many possibilities. I just recently read the S-LoRA paper, which is related, but it's even better to have a working (and extremely efficient!) implementation.How hard would it be to adapt your kernels to work with the new-gen quants like AWQ or EXL2?

评论 #38201489 未加载

vlovich123over 1 year ago

Am I correct in understanding that LoRA is basically a way to cheaply create “delta” LLMs that apply onto the main large one to create a specialization? In other words, this would obviate all the vector DB stuff that people are doing right?

评论 #38200566 未加载

评论 #38202198 未加载

yydingover 1 year ago

Good job! I observed that you implemented many cuda kernels by yourselves. Just wondering your consideration or trade-off between implementating the kernels via pure CUDA code vs. implementing based on compiler like TVM/Triton.

评论 #38198881 未加载

j0057over 1 year ago

That name is easy to confuse with the unrelated LoRa and LoRaWAN.

评论 #38203910 未加载

lmeyerovover 1 year ago

Super cool!I'm curious if there is a quality argument to be made: imagine needing to finetune k different classifiers...Before this work, we could train a single multi-label classifier by pooling the training sets, and deploy as 1 LoRaNow, we can have k distinct classifiers, and not risk them interfering with one anotherAny sense of, in realistic scenarios, when the quality of k distinct LoRas would be better?

kkielhofnerover 1 year ago

Nice!Any thoughts as to how this would come together with serving frameworks like vLLM, lmdeploy, Triton Inference Server, etc?

评论 #38200895 未加载

junrushao1994over 1 year ago

This is great! Have you guys considered integrating with one of the existing systems?

评论 #38198946 未加载

ruihanglover 1 year ago

Great work! I am curious that how much effort it would take to support LoRAs with different ranks?

评论 #38201569 未加载

busssardover 1 year ago

there was a word on GPT4 just being 8 different GPT3 in a trenchcoat finetuned on different topics. If we can do this now with 8x finetuned Vicuna 13b for the price of running Vicuna once, this is huge!

11 comments

huacover 1 year ago

评论 #38202127 未加载

评论 #38201469 未加载

kcorbittover 1 year ago

评论 #38200834 未加载

评论 #38203107 未加载

Palmikover 1 year ago

评论 #38201489 未加载

vlovich123over 1 year ago

评论 #38200566 未加载

评论 #38202198 未加载

yydingover 1 year ago

评论 #38198881 未加载

j0057over 1 year ago

That name is easy to confuse with the unrelated LoRa and LoRaWAN.

评论 #38203910 未加载

lmeyerovover 1 year ago

kkielhofnerover 1 year ago

Nice!Any thoughts as to how this would come together with serving frameworks like vLLM, lmdeploy, Triton Inference Server, etc?

评论 #38200895 未加载

junrushao1994over 1 year ago

This is great! Have you guys considered integrating with one of the existing systems?

评论 #38198946 未加载

ruihanglover 1 year ago

Great work! I am curious that how much effort it would take to support LoRAs with different ranks?

评论 #38201569 未加载

busssardover 1 year ago

Punica: Serving multiple LoRA finetuned LLM as one

11 comments

Punica: Serving multiple LoRA finetuned LLM as one

11 comments