How we got fine-tuning Mistral-7B to not suck

215 pointsby lewqover 1 year ago

12 comments

isaacfrondover 1 year ago

If you look at the source [1] you can see how they solved their what are the doctors going to do problem. It is literally included in one of the prompts now :-)Users tend to ask broad, vague questions of the document in order to test that the system is working. We want those queries to work well. For example, a user would ask "what are the doctors going to do?" of a document that is about a junior doctors' strike. Take this into account when generating the questions - in particular, refer to noun phrases by less specific descriptions, so for example instead of "junior doctors", say "doctors" in your questions.[1]: <a href="https://github.com/helixml/helix/blob/main/api/pkg/dataprep/qapairs/qapair_config.yaml">https://github.com/helixml/helix/blob/main/api/pkg/dataprep/...</a>

评论 #39302016 未加载

bugglebeetleover 1 year ago

Unsloth’s colab notebooks for fine-tuning Mistral-7B are super easy to use and run fine in just about any colab instance:<a href="https://github.com/unslothai/unsloth">https://github.com/unslothai/unsloth</a>It’s my default now for experimenting and basic training. If I want to get into the weeds, I use axolotl, but 9/10, it’s not really necessary.

评论 #39304198 未加载

评论 #39322927 未加载

评论 #39299685 未加载

nlover 1 year ago

I've done fine tuning too, but the reasons they mention in "Why not just use RAG?" aren't very good.People way understimate what RAG can do, even if in general people don't talk about the right things. For example LlamaIndex spends a lot of time talking about various extractors which is the easy part. The hard thing is deciding what you are actually searching for given a chat context.RAG is a horrible hack (and the more you understand about the more it seems so!) but it does work.I (and I'm sure everyone else) is experimenting with surgery on an LLM so it takes a vector representation of the docs directly alongside a text input so you don't have to do the lossy doc vector -> text -> LLM context -> vector thing. Not sure why no one has shipped this yet though!

评论 #39301390 未加载

评论 #39301573 未加载

gdiamosover 1 year ago

Glad to see that more people outside the big ai labs are figuring out how to do fine tuning. Some open source LLM authors also seem to have figured it out.I think many users get put off it because just pushing a button doesn’t work and the whole thing seems like a black box that you don’t know how to fix when it breaks.It turns out that finetuning can be debugged, but the methods aren’t well documented (yet), eg by generating q/a, oversampling them, etcWhen you get it to work it’s powerful - new abilities emerge beyond memorization.Just like how llama2/claude2/gpt4 learned reasoning by memorizing sentences from Reddit posts :PAlso, I don’t get the comparison of rag vs finetuning in articles like this - why not do both. RAG is easy to setup - it’s push button. Just do it on all models (including finetuned models).

评论 #39304077 未加载

评论 #39301503 未加载

joshkaover 1 year ago

For helix, I notice that GitHub is listed as a data source, but there's nothing in the docs about this. I'd really love to see what a model trained on my commonly used git repos (which generally are newer than The Stack etc), and in particular their commit history. Ideally these would make it easier for code completion to have the historical context as well as the current code to play with in determining what to write next.I often wonder how you'd go about organizing training data for a full historic github repo in a way that makes sense for training (or RAG)? The vast majority of the data is previous changes to the repo. I think this would generally mean that it would outweigh the current information and cause problems (i.e. old method names before refactoring etc.)Also, perhaps being able to expand that out to doing the same thing for a bunch of consumers of the library that I'm maintaining would be neat.Sprinkle in the PR and Issue history, docs website, API docs, and discord history and I think you'd have a helluva model.

评论 #39302690 未加载

cuuupidover 1 year ago

Not in love with axolotl but appreciate the advantages. This is an interesting approach, but you can also finetune easily on providers who wrap axolotl like Replicate [1], Modal [2], or if you want to run the infra, LLM Engine [3].My only gripe with Helix would be that it's smaller than the above and my org would be peeved about data security. The ability to self host is cool, but too much can go wrong too quickly with plain Docker ML. Would love to see, for example, a `cog` version of the images that we can deploy distributed with more confidence/bravado.[1] <a href="https://replicate.com/mistralai/mistral-7b-instruct-v0.2">https://replicate.com/mistralai/mistral-7b-instruct-v0.2</a> [2] <a href="https://modal.com" rel="nofollow">https://modal.com</a> [3] <a href="https://llm-engine.scale.com/" rel="nofollow">https://llm-engine.scale.com/</a>

评论 #39300996 未加载

AznHisokaover 1 year ago

Does fine tuning it on a set of docs in your “knowledge base” help for generalizing it so it can answer questions pertaining to new documents that come in (with a “similar” style/structure but with different content/fscts)?

评论 #39302647 未加载

评论 #39301470 未加载

_pdp_over 1 year ago

Interesting article but, IMHO, completely impractical. Teaching the model about specific content is totally what you should not do. What you should do is to teach the model how to effectively retrieve the information even if it is unsuccessful on the first try.

评论 #39304121 未加载

评论 #39300676 未加载

nicolezhuover 1 year ago

What are some os / hardware specific challenges you guys faced?

评论 #39304177 未加载

ipsum2over 1 year ago

The tl;dr seems to be: Tell a LLM to create pairs of questions and answers based off of a document, and fine-tune on that data. Does the model answer questions from the article that weren't generated in advance?

HanClintoover 1 year ago

Fantastic writeup -- thank you so much for sharing your lessons learned along the way! Very valuable resource, and I'll be checking out your product!

deforciantover 1 year ago

I always thought that fine tuning is more like getting a style rather than memorizing information word to word or at least the facts. What are the next steps to ensure that it doesn't start pulling info from the base knowledge and reference the docs instead? How long does it usually take to train? 10-15 minutes on what doc size?

评论 #39277909 未加载

评论 #39272558 未加载