I agree with summary. When I first wanted to tackle a hard problem I thought to reach for fine-tuning with lots of input and output pairs, but it wasn't needed.<p>Past few shot and RAG, you can overcome context window limits if you find ways to break a single request into many, each with specific context and then roll them up somehow. This can help get past context window limits.<p>Claude 2 has a large context window, but if you are actually giving that much in prompt examples, to cover tricky edge cases, I've found its better to break things down into multiple steps.<p>And if you can break things up that way, and costs isn't at issue, GPT-4, with lots of few shot examples, and chain of thought seems to give me the best results.<p>Or this is what I found writing a code translator for a language the LLM didn't know. I wrote it down in more details here:<p><a href="https://earthly.dev/blog/build-transpose/" rel="nofollow noreferrer">https://earthly.dev/blog/build-transpose/</a>
I think it is more nuanced. This article for example contains examples that suggest otherwise if you want to increase quality (which is a major concern when putting things in production):<p><a href="https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehensive-case-study-for-tailoring-models-to-unique-applications" rel="nofollow noreferrer">https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...</a>
Great post!<p>We've got some additional resources for folks looking to better understand Retrieval Augmented Generation (RAG) and even see it in action - in this example we demonstrate a potentially very dangerous hallucination (that has to do with driving) and how to fix it using RAG: <a href="https://www.pinecone.io/learn/retrieval-augmented-generation/" rel="nofollow noreferrer">https://www.pinecone.io/learn/retrieval-augmented-generation...</a><p>If you're curious to actually try out the difference between an LLM without domain-specific context and an LLM that is using RAG, you can try our live demo here: <a href="https://pinecone-vercel-starter.vercel.app/" rel="nofollow noreferrer">https://pinecone-vercel-starter.vercel.app/</a><p>And if you'd like to fork and make your own tweaks to the above demo ^ chatbot, in order to, for example, swap in your own company logo and extend it for your purposes, you can find our Vercel template here: <a href="https://github.com/pinecone-io/pinecone-vercel-starter">https://github.com/pinecone-io/pinecone-vercel-starter</a><p>In our opinion, RAG is indeed an effective technique partly because you don't need to be a machine learning expert in order to implement it in your Generative AI applications.
Completely agree. If you are in the hacker filter bubble you may get the impression that fine-tuning is super important and powerful. But in reality for most use cases it offers little advantage for _a lot_ of effort.<p>The future is likely > 90% of developers relying on the best frontier models and using the context to specialise, and 10% of specialised developers who have the expertise, budget, and time, customising LLMs for very specific use cases where there is no other option.
I tried finetuning the 13b LLaMa model to insert the knowledge from my own documents, but my experiments weren't successful. My conclusion is that you need billions of tokens to make a LLM reason based on your own dataset. And even if they did acquire these reasoning skills, they probably wouldn't beat GPT-4. And we are not even getting into the costs of self-hosting these LLM's. So why bother? Just use an API from these companies with powerful models and tweak it to deal with your own need.
The need to train/tune a model, in this case LLMs, is assumed to rely on the requirement for grounding and running on the edge or offline. This need will vary by use case.<p>With log file analysis as an example, training a model may increase the model's ability to deal with outliers, through writing regex which is placed in the indexing pipeline. In this use, tuning a prompt isn't going to help much, given the foundation model might have no idea how to parse a given field in a log line no matter how you put it to it in the prompt.<p>Tuning models also serves other purposes, such as removing guardrails introduced in the training data by others, and customizing the self referenced material the model "knows" about, such as its name, creators and the "personality" presented to the end user.
LLMs are a different beast in the ML world.<p>Finetuned Palm for Medicine and Finetuned Minerva for Math all perform a good deal worse than GPT-4.<p>A fine-tuned smaller model is by no means guaranteed to beat a larger more general one (though of course you may get acceptable performance).<p>And then the necessity of fine-tuning itself is called into question plenty with LLMs.<p><a href="https://huggingface.co/papers/2308.00304" rel="nofollow noreferrer">https://huggingface.co/papers/2308.00304</a><p><a href="https://huggingface.co/papers/2308.07921" rel="nofollow noreferrer">https://huggingface.co/papers/2308.07921</a><p><a href="https://arxiv.org/abs/2211.09066" rel="nofollow noreferrer">https://arxiv.org/abs/2211.09066</a>
RAG sucks. Microsoft is the force behind it because they don't allow training on their chatgpt models.<p>Fine-tuning even Lora on the open source models is nearly always better than these other approaches
We're using LLMs (OpenAI) to generate SQL queries to search customer data, and the current approach using chat API frequently generates queries using the wrong record/column names. I'm exploring use of fine tuning to improve accuracy on a customer/customer basis to train on their set of data, isn't that a good use case?
One of the main problems with LLMs today is they are gigantic, and this is because they are shipping the entire compressed memry of the training data with them.<p>Future LLMs are likely to have much smaller size and have outside long term memory/training knowledge as well as work memory (a'la RAG approach).
- After seeing how merely quantizing a model can make it go berserk, I have very little confidence I can fine tune an LLM and expect similar performance benchmarks.<p>- A RAG-empowered LLM can tell you where the knowledge used to answer a question came from.
Finetuning should never be the first step; it's slow, expensive, and indeterminant. Until you are maxing out that context window, you can just keep layering in more information into the prompt.
I wonder if the approach may change a bit when OpenAI releases fine tuning for the chat models. I think it depends on how well it works. If they find some way to significantly decrease the amount of training data needed or someone creates a tool to easily generate lots of training examples (using OpenAI), the advice might change again.<p>What also matters is the size of the context window and how effectively the models can follow large amounts of instructions. So new models might change the advice again.
I love how clearly this article is written. The author uses a table with the columns being "Initial Motivation for Fine-tuning" and "Why a Base LLM is Sufficient" - exactly what you need to learn why "You probably don’t need to fine-tune LLMs", which precisely the title. Using text to convey something that can be expressed as a table or a chart is just as bad as trying to do math without math notation. Stellar work!
Is there a method for this to be "Augmented" and not "Replacement" eg in the example from the blog post, "retriever=vectorstore.as_retriever()" which I believe would return something like "I don't know" if the content is not in the vectorstore.<p>In humans, a person might say something like, "I'm not an expert, but X" and I think being able to default back to the underlying LLM would be useful.
For most use cases this article is right on target.<p>I have been self hosting a 16K context size model and there is a lot you can do with 16K or larger context.<p>There are also great use cases for fine tuning. For example, if you are writing a chatbot for your company’s products, it might make sense to fine tune on product data and then RAG it with specific customer data when setting up a chat session.
Fine-tuning is such a dangerous phrased bc it sounds perfect.<p>“We’ll just <i>fine-tune</i> based on our (we think valuable + special) data”<p>Fine-tuning doesn’t enhance the model w/ new “knowledge” but a new narrowly-defined task<p>One other “cost” to consider is fine-tuning a 3rd-party model means if that foundation model changes or goes away that effort/cost needs to be repeated
It is worth noting that all the current models offered by the OpenAI API are already fine-tuned, with supervised learning and reinforcement learning, to follow instructions and to follow them in a certain way.<p>OpenAI removed the base GPT-3.5 model a while ago and never made it available for GPT-4.
What are people even doing with fine tuned LLMs? I can never think of something that it can't do natively or that I have enough data for to be able to fine tune a task for. Just curious
Besides the three mentioned in the article (stringent accuracy requirements, fast edge inference, involved style transfer task) are there other good reasons to fine-tune?
Fine tuning/training is useful if you want to start using the llm as a decoder for some sort of multi-modal embedding.<p>If it is just text corpora, probably not worth it