Fine-tune your own Llama 2 to replace GPT-3.5/4

955 pointsby kcorbittover 1 year ago

There has been a lot of interest on HN in fine-tuning open-source LLMs recently (eg. Anyscale's post at <a href="https://news.ycombinator.com/item?id=37090632">https://news.ycombinator.com/item?id=37090632</a>). I've been playing around with fine-tuning models for a couple of years, and wanted to share some insights and practical code. I’ve condensed what I’ve learned into a small set of notebooks at <a href="https://github.com/OpenPipe/OpenPipe/tree/main/examples/classify-recipes">https://github.com/OpenPipe/OpenPipe/tree/main/examples/clas...</a>, covering labeling data, fine-tuning, running efficient inference, and evaluating costs/performance. The 7B model we train here matches GPT-4’s labels 95% of the time on the test set, and for the 5% of cases where they disagree it’s often because the correct answer is genuinely ambiguous.What is fine-tuning? You can think of it as a more-powerful form of prompting, where instead of writing your instructions in text you actually encode them in the weights of the model itself. You do this by training an existing model on example input/output pairs that demonstrate the task you want your fine-tuned model to learn. Fine-tuning can work with as few as 50 examples but I usually try to get 1000+ if possible.Prompting still has some big advantages over fine-tuning. It's way easier/faster to iterate on your instructions than label data and re-train a model. And operationally it's easier to deploy one big model and just adjust its behavior as necessary vs deploying many small fine-tuned models that will likely each get lower utilization.Fine-tuning has one huge advantage though: it is far more effective at guiding a model's behavior than prompting, so you can often get away with a much smaller model. That gets you faster responses and lower inference costs. A fine-tuned Llama 7B model is 50x cheaper than GPT-3.5 on a per-token basis, and for many use cases can produce results that are as good or better!For example, classifying the 2M recipes at <a href="https://huggingface.co/datasets/corbt/all-recipes" rel="nofollow noreferrer">https://huggingface.co/datasets/corbt/all-recipes</a> with GPT-4 would cost $23k. Even with GPT-3.5 it would cost over $1k. The model we fine-tuned performs similarly to GPT-4 and costs just $19 to run over the entire dataset.Disclaimer: My brother David and I are working on an open-source product called OpenPipe (<a href="https://github.com/openpipe/openpipe">https://github.com/openpipe/openpipe</a>) to help engineers adopt fine-tuning as simply as possible. But none of the information above depends on our startup. The current post is just about sharing information that we’ve learned about fine-tuning. I hope it’s useful!

36 comments

ronyfadelover 1 year ago

For translation jobs, I've experimented with Llama 2 70B (running on Replicate) v/s GPT-3.5;For about 1000 input tokens (and resulting 1000 output tokens), to my surprise, GPT-3.5 turbo was 100x cheaper than Llama 2.Llama 7B wasn't up to the task fyi, producing very poor translations.I believe that OpenAI priced GPT-3.5 aggressively cheap in order to make it a non-brainer to rely on them rather than relying on other vendors (even open source models).I'm curious to see if others have gotten different results?

评论 #37485645 未加载

评论 #37485786 未加载

评论 #37486442 未加载

评论 #37489601 未加载

评论 #37486602 未加载

评论 #37486556 未加载

评论 #37485629 未加载

评论 #37504689 未加载

评论 #37489646 未加载

评论 #37498898 未加载

评论 #37487305 未加载

评论 #37486942 未加载

评论 #37487891 未加载

评论 #37485902 未加载

评论 #37492162 未加载

评论 #37487653 未加载

评论 #37485893 未加载

tikkunover 1 year ago

Looks really well executed, nice! I'd shared this idea with a few people. GPT and other LLMs don't allow you to use their output to train competing models, but the implication is that it's fine to use their output to train your own internal alternative models. So you can't sell access to the output as an API, but you can use it to replace your GPT API calls.My other thoughts to extend this are that you could make it seamless. To start, it'll simply pipe the user's requests to OpenAI or their existing model. So it'd be a drop in replacement. Then, it'll every so often offer to the user - "hey we think at this point there's enough data that a fine tune might save you approx $x/month based on your current calls, click the button to start the fine tune and we'll email you once we have the results" - and then the user gets the email "here are the results, based on that we recommend switching, click here to switch to calling your fine-tuned model" - Helicone and the other monitoring platforms could also offer something similar. (Side note I'm working on an "ai infra handbook" aimed at technical people in software orgs looking to deploy unspecified "AI" features and trying to figure out what to do and what resources they'll need - it's a 20+ page google doc, if anyone can help me review what I have so far please let me know and I'll add you.)If it's latency/error/speed competitive, and cheaper, and equivalently accurate, then for anyone doing production scale LLM API usage it'd make sense to use something like this - either the fine-tune is worse so you keep using the regular API, or the fine tune has parity plus cost and/or speed advantage, so you switch. (It wouldn't make sense for prototyping scale, because the additional complexity of the switch wouldn't be worth it unless it could save you 4/5 or more figures a year in API costs I'd think.)

评论 #37484927 未加载

评论 #37486300 未加载

评论 #37485592 未加载

评论 #37491956 未加载

评论 #37496524 未加载

评论 #37489628 未加载

评论 #37484540 未加载

minimaxirover 1 year ago

> Fine-tuning has one huge advantage though: it is far more effective at guiding a model's behavior than prompting, so you can often get away with a much smaller model. That gets you faster responses and lower inference costs. A fine-tuned Llama 7B model is 50x cheaper than GPT-3.5 on a per-token basis, and for many use cases can produce results that are as good or better!These comparisons are reductive to the point of being misleading. Even with all the optimizations in the ecosystem, it's not trivial to get a finetuned 7B param model running at an acceptable inference latency. Even if you use a GPU such as an A100 for maximum speed, then you have scalability issues since A100s are scarce. Also, the "50% cheaper" assumes 100% utilization of a GPU which will never happen in production use cases.Quality-wise, a finetuned Llama 2 is not necessairly better than ChatGPT. Finetuning requires a high-quality dataset which is not easy to construct. And in my own experience with finetuning Llama 2, qualitivately it caused more frustration to get outputs on par with just using ChatGPT.The value of the ChatGPT API is more dependable scaling and not having to pay for an infra.

评论 #37484884 未加载

评论 #37485091 未加载

评论 #37484801 未加载

评论 #37485159 未加载

binarymaxover 1 year ago

This looks awesome! Tangential question - do you find GPT function calling to work consistently and without error, or do you get errors when using it? By errors I mostly mean incorrect function signatures/types or missing values...but if you see other unpredictable behavior that would help too.

评论 #37489262 未加载

评论 #37486241 未加载

derekpankaewover 1 year ago

Can you clarify the 50x cheaper number? Is this for self-hosting, or if you're hosting on OpenPipe?The pricing on OpenPipe says it's 0.0012 to 0.0016 per 1K tokens for Llama 7b. GPT-3.5 pricing is 0.0015 to 0.002, so not that different.I'm assuming the 50x cost reductions are primarily from self-hosting?

评论 #37489086 未加载

szesiongteoover 1 year ago

I think the cost calculation here does not reflect the actual scenario where most people face. In real world scenario, we don't get inputs queued up to millions and wait for the GPU to inference them continuously at 100% utilization. We need to ensure the user get their response in time, and assume that we get all the inputs spread out evenly within a month, we have to look at the cost of running GPU for a month vs using OpenAI API.

divbzeroover 1 year ago

Is Llama 2 currently the way to go for fine-tuning your own models? Are there other open-source LLMs worth considering?

评论 #37486768 未加载

评论 #37487194 未加载

评论 #37489483 未加载

评论 #37487218 未加载

brianjkingover 1 year ago

Very nice, thanks!Check out what Matt Shumer put together as well: <a href="https://github.com/mshumer/gpt-llm-trainer">https://github.com/mshumer/gpt-llm-trainer</a>.I have used his trainer for auto distillation of GPT-4 into GPT3.5 fine tunes, but plan to do the same for Llama as well.Cheers!

loganathansprover 1 year ago

I am a little bit confused whether I need fine-tuning or RAG for my use case? My use case is this: I have some private data (say 1000 word documents), I want a QA capability on those 1000 documents. What is the best approach? Any help is appreciated.

评论 #37502797 未加载

rrherrover 1 year ago

"You do this by training an existing model on example input/output pairs that demonstrate the task you want your fine-tuned model to learn."Are fine-tuning datasets required to be input/output pairs? Or instead, can the fine-tuning be autoregressive (predict the next token throughout this corpus of unlabeled documents)?

评论 #37486687 未加载

评论 #37513607 未加载

Maschineskyover 1 year ago

What makes sense to fine-tune and what not?You said 50-1000 examples.Do I fine-tune when having specific q/a sets like from real customers and I want to add the right answer to the model?Do I fine-tune facts or should I use some lookup?Does adding some code and API docs for a current version of something I want more support make sense? Like chatgpt knows quarkus 2 but not quarkus 3

评论 #37486105 未加载

评论 #37486109 未加载

imhoguyover 1 year ago

I have such use case: I have Java project I develop, I also used phind-codellama-36B-q8 with very satisfying results to aid the development.Can I train it further using the project source to let the model "understand" the project context more?

ingridpanover 1 year ago

I found this tutorial helpful for getting started with fine-tuning <a href="https://www.youtube.com/watch?v=74NSDMvYZ9Y">https://www.youtube.com/watch?v=74NSDMvYZ9Y</a>This guy used gradient.ai and he has a Google Collab to try it

Dowwieover 1 year ago

If I paid $20 to fine-tune a model to do X, and you paid $20 to fine-tune a model to do Y, is there a way to merge models, aggregating X and Y training, without training from scratch again?

accrualover 1 year ago

This looks very helpful! I'm just starting out in the ML/LLM space and have an opportunity to work on this at $dayjob, bookmarking as this looks like an excellent resource. Thank you!

ztenover 1 year ago

Thank you for posting this. I had to go look for your HuggingFace data sets to find the labeled variety you produced with GPT-4, but other than that, everything was easy to follow.

rookie123over 1 year ago

To all those who are on this panel, which is the most comprehensive way a newbie can learn fine-tuning these models with or without the GPUs?Are there any well directed courses available?

评论 #37488506 未加载

smkoover 1 year ago

This looks very interesting and looks like GPT3.5 is subsidized heavily. Given the advantage of scale economics for OpenAI its going to be difficult for a corporation to justify spending on their equipment and administration costs. This is where security of data and other non-functional requirements will justify training and running your own models.

anitakirkovskaover 1 year ago

Thanks for sharing this! I think you're working on something amazing. I will include your links in my newsletter, I think it will help a lot of folks: <a href="https://www.theprompt.io/" rel="nofollow noreferrer">https://www.theprompt.io/</a>

indeyetsover 1 year ago

What are hardware requirements for larger models? What can I fine-tune on Nvidia A100? Will it be possible to work with 70b for example?

评论 #37484809 未加载

atleastoptimalover 1 year ago

Fine-tuned low parameter LLM's are superficially good but the cracks are obvious if you test them on anything that isn't very strictly tied to the training data. IMO GPT-4 is really the first LLM that's broken out of the fake intelligence quality most LLM's seem to have, though only by a little.

评论 #37492788 未加载

halyconWaysover 1 year ago

Someone needs to make an LLM purpose-built for creating high-quality datasets for fine-tuning other LLMs.

评论 #37492336 未加载

评论 #37487210 未加载

davidwritesbugsover 1 year ago

Genuinely informative reply for those (few) of us on HN who don’t know the details of LLMs, thanks

msp26over 1 year ago

Do you still use few-shot prompting with a fine-tune? Or does it make little difference?

评论 #37484943 未加载

评论 #37485456 未加载

jxfover 1 year ago

Q: How did you arrive at the $23k figure for classifying 2M examples using GPT-4?

评论 #37486207 未加载

he11owover 1 year ago

Thanks! When it comes to choosing where to work with these models, which compute platform do you recommend (assuming locally doesn't really make sense with my resources)? Colab? AWS StudioLab?Which is your go to?

avereveardover 1 year ago

A 7b model will work for very specific cases, but it will have a hard time drawing parallels between synonims, so you'll need to be extremely careful in building your fine tuning samples.

3abitonover 1 year ago

Do you think this would end up facilitating the diffusion of finetuned LLMs ckpt models, just like stable diffusion? What's missing is web-UI?

评论 #37486393 未加载

robotover 1 year ago

for startups I guess this means nail your use case with gpt-4, and when scaling cost becomes an issue consider fine tuning.

braindead_inover 1 year ago

I have been trying to figure out how to fine tune codellama. Will the llama2 examples work for codellama as well?

facu17yover 1 year ago

"to replace GPT-3.5/4"Very inflated statement when it comes to GPT4 since it is a MoE model with 8 separate models each an expert in one area, and you can't replace all 8 models with one model trained for $19.I call BS on this claim. Maybe it matches GPT4 in the narrow domain you fine-tune it for, and if that can be done for $19 then for $19*8 you can take OpenAI out of business. That doesn't add up.

idoshover 1 year ago

Can you elaborate on your plans for OpenPipe? Sounds like a very interesting project

评论 #37486499 未加载

caromover 1 year ago

What are your thoughts on fine tuning vs low rank adaptations?

评论 #37501824 未加载

notShabuover 1 year ago

This post made me think of human hierarchies. Line level ICs are cheap because they are specialized and fine tuned. Leet code is a way to roughly measure degree of fine-tuning even though it doesn't accurately measure how well the fine tuning is for the job.As you go up the hierarchy what you want is higher quality answers to more and more abstract and general questions.AGI, God, CEOs, and figures like Paul Graham, Elon Musk etc.. all answer to various degrees the ultimate abstract question of "What is the meaning of gestures wildly at everything"Cost efficiency and commoditization basically increases "how" capacity at the cost of "why" capacity

评论 #37487251 未加载

jesusofnazarathover 1 year ago

Can't we have something for the command line that takes the form of<pre><code> cat new_data.txt | finetune model.file > new_model.file</code></pre>

评论 #37489689 未加载

OhNoNotAgain_99over 1 year ago

just curious would it be possible to add a small network perhaps a books of study material like programming books. freeze the weights of the existing large network, and combined with the new network try to predict the book. The existing networks know language but not the content, the combined network will be trained on the content, and eventually toegther they score better, These "small" added networks might just be specific towards a certain topic (ea learn python or so). Then these small networks can be become modular. esesentially creating some kind of lora networks for LLM's.Maybe start this way from the ground up, so you can get modular units, for health, finance, programming, education, writting assitance, phyloophy, ethics etc etc. If the modules can be changed, then one might be able to reduce their seize. Ea pick 2 or 3 chain them and one has a LLM for a specific area of interest. (reducing running cost)

评论 #37489588 未加载