Just to add to this, I run through a lot of these topics around fine-tuning Llama 2 on your own dataset (for me it's my own code :P) in a coding live stream a couple weeks ago. All on Colab single GPU<p>Fine-tuning Llama stream: <a href="https://www.youtube.com/watch?v=TYgtG2Th6fI&t=2282s">https://www.youtube.com/watch?v=TYgtG2Th6fI&t=2282s</a><p>I have a couple more one where I do a QLoRa fine tuning session and explain the concepts as a personally self taught engineer (software engineer of 8 years moving into ML recently)<p>QloRa fine-tuning stream: <a href="https://www.youtube.com/watch?v=LitybCiLhSc&t=4584s">https://www.youtube.com/watch?v=LitybCiLhSc&t=4584s</a><p>Overall I'm trying to breakdown how I'm approaching a lot of my personal projects and my current AI driven startup. Want to make this information as accessible as possible. Also have a series where I'm fine-tuning a model to be the smallest webdev llm as possible which seems like people are liking. Only been streaming for about a month and plenty more to come.<p>Ask me any question about the stream and fine-tuning llama!
> Additionally, while this wasn’t an issue for GPT, the Llama chat models would often output hundreds of miscellaneous tokens that were unnecessary for the task, further slowing down their inference time (e.g. “Sure! Happy to help…”).<p>That's the problem I've been facing with Llama 2 as well. It's almost impossible to have it just output the desired text. It will always add something before and after its response. Does anyone know if there's any prompt technique to fix this problem?
I'm really glad to see a post like this come out. I've seen so many discussions online about customizing models -- this post really does cut through the noise.<p>Really like the evaluation methodology, and seems well-written as well.
It's weird that Lora and training with quantization is not being taken more seriously. It's way cheaper, takes less time, and a lot of evidence shows it's pretty good.<p>I don't think it should be something brushed on the side to be tried out later..
Glad to see the NER-like task performed the best, as I was just about to test something like this for comparison with a fine-tuned BERT model. Any idea about the training costs for this task?
One challenge is that to get large enough custom datasets you either need a small army or a very strong existing model. Which means that you probably have to use OpenAI. And using OpenAI to generate training material for another model violates their terms.<p>Has anyone taken them to court about this? Do we all just decide it's not fair and ignore it?
Disclaimer: I work for Anyscale<p>This blog seems to got good attention :) So we definitely plan to add it to Ray Summit <a href="https://raysummit.anyscale.com/agenda" rel="nofollow noreferrer">https://raysummit.anyscale.com/agenda</a><p>Please comment on this thread if you have ideas of what kind of content you want to see more at Ray Summit
> ~14 min. for 7B for 1 epoch on 3.5M tokens. ~26 min for 13B for 1 epoch.<p>> At least 1xg5.16xlarge for head-node and 15xg5.4xlarge for worker nodes for both 7B and 13B<p>For the uninitiated, anyone have an idea how much this would cost on AWS?
Is this possible to fine tune llama-2 locally on M1 Ultra 64GB, I would like to know or any pointer would be good. Most of them are on Cloud or using Nvidia Cuda on linux.