There has been a lot of interest on HN in fine-tuning open-source LLMs recently (eg. Anyscale's post at <a href="https://news.ycombinator.com/item?id=37090632">https://news.ycombinator.com/item?id=37090632</a>). I've been playing around with fine-tuning models for a couple of years, and wanted to share some insights and practical code. I’ve condensed what I’ve learned into a small set of notebooks at <a href="https://github.com/OpenPipe/OpenPipe/tree/main/examples/classify-recipes">https://github.com/OpenPipe/OpenPipe/tree/main/examples/clas...</a>, covering labeling data, fine-tuning, running efficient inference, and evaluating costs/performance. The 7B model we train here matches GPT-4’s labels 95% of the time on the test set, and for the 5% of cases where they disagree it’s often because the correct answer is genuinely ambiguous.<p>What is fine-tuning? You can think of it as a more-powerful form of prompting, where instead of writing your instructions in text you actually encode them in the weights of the model itself. You do this by training an existing model on example input/output pairs that demonstrate the task you want your fine-tuned model to learn. Fine-tuning can work with as few as 50 examples but I usually try to get 1000+ if possible.<p>Prompting still has some big advantages over fine-tuning. It's way easier/faster to iterate on your instructions than label data and re-train a model. And operationally it's easier to deploy one big model and just adjust its behavior as necessary vs deploying many small fine-tuned models that will likely each get lower utilization.<p>Fine-tuning has one huge advantage though: it is far more effective at guiding a model's behavior than prompting, so you can often get away with a <i>much</i> smaller model. That gets you faster responses and lower inference costs. A fine-tuned Llama 7B model is 50x cheaper than GPT-3.5 on a per-token basis, and for many use cases can produce results that are as good or better!<p>For example, classifying the 2M recipes at <a href="https://huggingface.co/datasets/corbt/all-recipes" rel="nofollow noreferrer">https://huggingface.co/datasets/corbt/all-recipes</a> with GPT-4 would cost $23k. Even with GPT-3.5 it would cost over $1k. The model we fine-tuned performs similarly to GPT-4 and costs just $19 to run over the entire dataset.<p>Disclaimer: My brother David and I are working on an open-source product called OpenPipe (<a href="https://github.com/openpipe/openpipe">https://github.com/openpipe/openpipe</a>) to help engineers adopt fine-tuning as simply as possible. But none of the information above depends on our startup. The current post is just about sharing information that we’ve learned about fine-tuning. I hope it’s useful!
For translation jobs, I've experimented with Llama 2 70B (running on Replicate) v/s GPT-3.5;<p>For about 1000 input tokens (and resulting 1000 output tokens), to my surprise, GPT-3.5 turbo was <i>100x cheaper</i> than Llama 2.<p>Llama 7B wasn't up to the task fyi, producing very poor translations.<p>I believe that OpenAI priced GPT-3.5 aggressively cheap in order to make it a non-brainer to rely on them rather than relying on other vendors (even open source models).<p>I'm curious to see if others have gotten different results?
Looks really well executed, nice! I'd shared this idea with a few people. GPT and other LLMs don't allow you to use their output to train competing models, but the implication is that it's fine to use their output to train your own internal alternative models. So you can't sell access to the output as an API, but you can use it to replace your GPT API calls.<p>My other thoughts to extend this are that you could make it seamless. To start, it'll simply pipe the user's requests to OpenAI or their existing model. So it'd be a drop in replacement. Then, it'll every so often offer to the user - "hey we think at this point there's enough data that a fine tune might save you approx $x/month based on your current calls, click the button to start the fine tune and we'll email you once we have the results" - and then the user gets the email "here are the results, based on that we recommend switching, click here to switch to calling your fine-tuned model" - Helicone and the other monitoring platforms could also offer something similar. (Side note I'm working on an "ai infra handbook" aimed at technical people in software orgs looking to deploy unspecified "AI" features and trying to figure out what to do and what resources they'll need - it's a 20+ page google doc, if anyone can help me review what I have so far please let me know and I'll add you.)<p><i>If</i> it's latency/error/speed competitive, and cheaper, and equivalently accurate, then for anyone doing production scale LLM API usage it'd make sense to use something like this - either the fine-tune is worse so you keep using the regular API, or the fine tune has parity plus cost and/or speed advantage, so you switch. (It wouldn't make sense for prototyping scale, because the additional complexity of the switch wouldn't be worth it unless it could save you 4/5 or more figures a year in API costs I'd think.)
> Fine-tuning has one huge advantage though: it is far more effective at guiding a model's behavior than prompting, so you can often get away with a much smaller model. That gets you faster responses and lower inference costs. A fine-tuned Llama 7B model is 50x cheaper than GPT-3.5 on a per-token basis, and for many use cases can produce results that are as good or better!<p>These comparisons are reductive to the point of being misleading. Even with all the optimizations in the ecosystem, it's not trivial to get a finetuned 7B param model running at an acceptable inference latency. Even if you use a GPU such as an A100 for maximum speed, then you have scalability issues since A100s are scarce. Also, the "50% cheaper" assumes 100% utilization of a GPU which will never happen in production use cases.<p>Quality-wise, a finetuned Llama 2 is not necessairly better than ChatGPT. Finetuning requires a high-quality dataset which is not easy to construct. And in my own experience with finetuning Llama 2, qualitivately it caused more frustration to get outputs on par with just using ChatGPT.<p>The value of the ChatGPT API is more dependable scaling and not having to pay for an infra.
This looks awesome! Tangential question - do you find GPT function calling to work consistently and without error, or do you get errors when using it? By errors I mostly mean incorrect function signatures/types or missing values...but if you see other unpredictable behavior that would help too.
Can you clarify the 50x cheaper number? Is this for self-hosting, or if you're hosting on OpenPipe?<p>The pricing on OpenPipe says it's 0.0012 to 0.0016 per 1K tokens for Llama 7b. GPT-3.5 pricing is 0.0015 to 0.002, so not that different.<p>I'm assuming the 50x cost reductions are primarily from self-hosting?
I think the cost calculation here does not reflect the actual scenario where most people face. In real world scenario, we don't get inputs queued up to millions and wait for the GPU to inference them continuously at 100% utilization. We need to ensure the user get their response in time, and assume that we get all the inputs spread out evenly within a month, we have to look at the cost of running GPU for a month vs using OpenAI API.
Very nice, thanks!<p>Check out what Matt Shumer put together as well: <a href="https://github.com/mshumer/gpt-llm-trainer">https://github.com/mshumer/gpt-llm-trainer</a>.<p>I have used his trainer for auto distillation of GPT-4 into GPT3.5 fine tunes, but plan to do the same for Llama as well.<p>Cheers!
I am a little bit confused whether I need fine-tuning or RAG for my use case?
My use case is this: I have some private data (say 1000 word documents), I want a QA capability on those 1000 documents. What is the best approach? Any help is appreciated.
"You do this by training an existing model on example input/output pairs that demonstrate the task you want your fine-tuned model to learn."<p>Are fine-tuning datasets required to be input/output pairs? Or instead, can the fine-tuning be autoregressive (predict the next token throughout this corpus of unlabeled documents)?
What makes sense to fine-tune and what not?<p>You said 50-1000 examples.<p>Do I fine-tune when having specific q/a sets like from real customers and I want to add the right answer to the model?<p>Do I fine-tune facts or should I use some lookup?<p>Does adding some code and API docs for a current version of something I want more support make sense? Like chatgpt knows quarkus 2 but not quarkus 3
I have such use case: I have Java project I develop, I also used phind-codellama-36B-q8 with very satisfying results to aid the development.<p>Can I train it further using the project source to let the model "understand" the project context more?
I found this tutorial helpful for getting started with fine-tuning <a href="https://www.youtube.com/watch?v=74NSDMvYZ9Y">https://www.youtube.com/watch?v=74NSDMvYZ9Y</a><p>This guy used gradient.ai and he has a Google Collab to try it
If I paid $20 to fine-tune a model to do X, and you paid $20 to fine-tune a model to do Y, is there a way to merge models, aggregating X and Y training, without training from scratch again?
This looks very helpful! I'm just starting out in the ML/LLM space and have an opportunity to work on this at $dayjob, bookmarking as this looks like an excellent resource. Thank you!
Thank you for posting this. I had to go look for your HuggingFace data sets to find the labeled variety you produced with GPT-4, but other than that, everything was easy to follow.
To all those who are on this panel, which is the most comprehensive way a newbie can learn fine-tuning these models with or without the GPUs?<p>Are there any well directed courses available?
This looks very interesting and looks like GPT3.5 is subsidized heavily. Given the advantage of scale economics for OpenAI its going to be difficult for a corporation to justify spending on their equipment and administration costs. This is where security of data and other non-functional requirements will justify training and running your own models.
Thanks for sharing this! I think you're working on something amazing. I will include your links in my newsletter, I think it will help a lot of folks: <a href="https://www.theprompt.io/" rel="nofollow noreferrer">https://www.theprompt.io/</a>
Fine-tuned low parameter LLM's are superficially good but the cracks are obvious if you test them on anything that isn't very strictly tied to the training data. IMO GPT-4 is really the first LLM that's broken out of the fake intelligence quality most LLM's seem to have, though only by a little.
Thanks! When it comes to choosing where to work with these models, which compute platform do you recommend (assuming locally doesn't really make sense with my resources)?
Colab?
AWS StudioLab?<p>Which is your go to?
A 7b model will work for very specific cases, but it will have a hard time drawing parallels between synonims, so you'll need to be extremely careful in building your fine tuning samples.
"to replace GPT-3.5/4"<p>Very inflated statement when it comes to GPT4 since it is a MoE model with 8 separate models each an expert in one area, and you can't replace all 8 models with one model trained for $19.<p>I call BS on this claim. Maybe it matches GPT4 in the narrow domain you fine-tune it for, and if that can be done for $19 then for $19*8 you can take OpenAI out of business. That doesn't add up.
This post made me think of human hierarchies. Line level ICs are cheap because they are specialized and fine tuned. Leet code is a way to roughly measure degree of fine-tuning even though it doesn't accurately measure how well the fine tuning is for the job.<p>As you go up the hierarchy what you want is higher quality answers to more and more abstract and general questions.<p>AGI, God, CEOs, and figures like Paul Graham, Elon Musk etc.. all answer to various degrees the ultimate abstract question of "What is the meaning of <i>gestures wildly at everything</i>"<p>Cost efficiency and commoditization basically increases "how" capacity at the cost of "why" capacity
Can't we have something for the command line that takes the form of<p><pre><code> cat new_data.txt | finetune model.file > new_model.file</code></pre>
just curious would it be possible to add a small network perhaps a books of study material like programming books. freeze the weights of the existing large network, and combined with the new network try to predict the book. The existing networks know language but not the content, the combined network will be trained on the content, and eventually toegther they score better, These "small" added networks might just be specific towards a certain topic (ea learn python or so).
Then these small networks can be become modular. esesentially creating some kind of lora networks for LLM's.<p>Maybe start this way from the ground up, so you can get modular units, for health, finance, programming, education, writting assitance, phyloophy, ethics etc etc.
If the modules can be changed, then one might be able to reduce their seize.
Ea pick 2 or 3 chain them and one has a LLM for a specific area of interest.
(reducing running cost)