I think this is one of the most important possible works for open source LLM's, really glad y'all pushed this forward!<p>That's not hyperbole. Why is OpenAI able to charge so little for their API's? I have heard rival mega LLM company CEO's complain that OpenAI's prices would be a loss for their rivals. But I think it's still positive margin, and that they can charge low prices for API because they've invested more into managing the infra, sure, but most importantly because they have the best utilization of their existing hardware.<p>If it costs everyone $X/gpu/hr to serve models, the company that has the most throughput wins on price. In a world without finetunes, the most capable model, the one that can zero- or few-shot the most tasks will have the most usage. Finetuned open models can reach parity with GPT on narrow tasks, but until now, having public providers serve the models was expensive. Your private finetune is only going to be queried by you, not everyone, so it's super expensive to serve on a per token level. With hot swappable LoRA adapters, that calculus changes, and the cost per token can go way down. Super, super exciting!
Awesome work! Here's a recent paper released yesterday, also focused on efficiently serving many LoRAs simultaneously: <a href="https://arxiv.org/abs/2311.03285" rel="nofollow noreferrer">https://arxiv.org/abs/2311.03285</a><p>Really looking forward to these innovations becoming more widespread -- I expect we're very close to a world where training a LoRA on a one-off task like "review every HN post from the last 3 years and flag any of them that contain informed speculation about the architecture of GPT-4" will be easy, cheap and routine.
This is amazing, and will unlock many possibilities. I just recently read the S-LoRA paper, which is related, but it's even better to have a working (and extremely efficient!) implementation.<p>How hard would it be to adapt your kernels to work with the new-gen quants like AWQ or EXL2?
Am I correct in understanding that LoRA is basically a way to cheaply create “delta” LLMs that apply onto the main large one to create a specialization? In other words, this would obviate all the vector DB stuff that people are doing right?
Good job!
I observed that you implemented many cuda kernels by yourselves. Just wondering your consideration or trade-off between implementating the kernels via pure CUDA code vs. implementing based on compiler like TVM/Triton.
Super cool!<p>I'm curious if there is a quality argument to be made: imagine needing to finetune k different classifiers...<p>Before this work, we could train a single multi-label classifier by pooling the training sets, and deploy as 1 LoRa<p>Now, we can have k distinct classifiers, and not risk them interfering with one another<p>Any sense of, in realistic scenarios, when the quality of k distinct LoRas would be better?
there was a word on GPT4 just being 8 different GPT3 in a trenchcoat finetuned on different topics.
If we can do this now with 8x finetuned Vicuna 13b for the price of running Vicuna once, this is huge!