If you look at the source [1] you can see how they solved their what are the doctors going to do problem. It is literally included in one of the prompts now :-)<p><i>Users tend to ask broad, vague questions of the document in order to test that the system is working. We want those queries to work well. For example, a user would ask "what are the doctors going to do?" of a document that is about a junior doctors' strike. Take this into account when generating the questions - in particular, refer to noun phrases by less specific descriptions, so for example instead of "junior doctors", say "doctors" in your questions.</i><p>[1]: <a href="https://github.com/helixml/helix/blob/main/api/pkg/dataprep/qapairs/qapair_config.yaml">https://github.com/helixml/helix/blob/main/api/pkg/dataprep/...</a>
Unsloth’s colab notebooks for fine-tuning Mistral-7B are super easy to use and run fine in just about any colab instance:<p><a href="https://github.com/unslothai/unsloth">https://github.com/unslothai/unsloth</a><p>It’s my default now for experimenting and basic training. If I want to get into the weeds, I use axolotl, but 9/10, it’s not really necessary.
I've done fine tuning too, but the reasons they mention in "Why not just use RAG?" aren't very good.<p>People way understimate what RAG can do, even if in general people don't talk about the right things. For example LlamaIndex spends a lot of time talking about various extractors which is the easy part. The hard thing is deciding what you are actually searching for given a chat context.<p>RAG is a horrible hack (and the more you understand about the more it seems so!) but it does work.<p>I (and I'm sure everyone else) is experimenting with surgery on an LLM so it takes a vector representation of the docs directly alongside a text input so you don't have to do the lossy doc vector -> text -> LLM context -> vector thing. Not sure why no one has shipped this yet though!
Glad to see that more people outside the big ai labs are figuring out how to do fine tuning. Some open source LLM authors also seem to have figured it out.<p>I think many users get put off it because just pushing a button doesn’t work and the whole thing seems like a black box that you don’t know how to fix when it breaks.<p>It turns out that finetuning can be debugged, but the methods aren’t well documented (yet), eg by generating q/a, oversampling them, etc<p>When you get it to work it’s powerful - new abilities emerge beyond memorization.<p>Just like how llama2/claude2/gpt4 learned reasoning by memorizing sentences from Reddit posts :P<p>Also, I don’t get the comparison of rag vs finetuning in articles like this - why not do both. RAG is easy to setup - it’s push button. Just do it on all models (including finetuned models).
For helix, I notice that GitHub is listed as a data source, but there's nothing in the docs about this. I'd really love to see what a model trained on my commonly used git repos (which generally are newer than The Stack etc), and in particular their commit history. Ideally these would make it easier for code completion to have the historical context as well as the current code to play with in determining what to write next.<p>I often wonder how you'd go about organizing training data for a full historic github repo in a way that makes sense for training (or RAG)? The vast majority of the data is previous changes to the repo. I think this would generally mean that it would outweigh the current information and cause problems (i.e. old method names before refactoring etc.)<p>Also, perhaps being able to expand that out to doing the same thing for a bunch of consumers of the library that I'm maintaining would be neat.<p>Sprinkle in the PR and Issue history, docs website, API docs, and discord history and I think you'd have a helluva model.
Not in love with axolotl but appreciate the advantages. This is an interesting approach, but you can also finetune easily on providers who wrap axolotl like Replicate [1], Modal [2], or if you want to run the infra, LLM Engine [3].<p>My only gripe with Helix would be that it's smaller than the above and my org would be peeved about data security. The ability to self host is cool, but too much can go wrong too quickly with plain Docker ML. Would love to see, for example, a `cog` version of the images that we can deploy distributed with more confidence/bravado.<p>[1] <a href="https://replicate.com/mistralai/mistral-7b-instruct-v0.2">https://replicate.com/mistralai/mistral-7b-instruct-v0.2</a>
[2] <a href="https://modal.com" rel="nofollow">https://modal.com</a>
[3] <a href="https://llm-engine.scale.com/" rel="nofollow">https://llm-engine.scale.com/</a>
Does fine tuning it on a set of docs in your “knowledge base” help for generalizing it so it can answer questions pertaining to new documents that come in (with a “similar” style/structure but with different content/fscts)?
Interesting article but, IMHO, completely impractical. Teaching the model about specific content is totally what you should not do. What you should do is to teach the model how to effectively retrieve the information even if it is unsuccessful on the first try.
The tl;dr seems to be: Tell a LLM to create pairs of questions and answers based off of a document, and fine-tune on that data. Does the model answer questions from the article that weren't generated in advance?
Fantastic writeup -- thank you so much for sharing your lessons learned along the way! Very valuable resource, and I'll be checking out your product!
I always thought that fine tuning is more like getting a style rather than memorizing information word to word or at least the facts. What are the next steps to ensure that it doesn't start pulling info from the base knowledge and reference the docs instead?
How long does it usually take to train? 10-15 minutes on what doc size?