Fine tune a 70B language model at home

909 点作者 jph00大约 1 年前

Jeremy from Answer.AI here. This is our first project since launching our new R&D lab at the start of this year.It's the #1 most requested thing I've been hearing from open source model builders: the ability to use multiple GPUs with QLoRA training. So that's why we decided to make it our first project.Huge thanks to Tim Dettmers for helping us get started to this -- and of course for creating QLoRA in the first place!Let me know if you have any questions or thoughts.

43 条评论

jph00大约 1 年前

One thing I forgot to mention in the post which I think is kinda cool: at the NeurIPS Efficiency Challenge this year, where Tim Dettmers and I both did keynotes, every single top-ranked entry used QLoRA! The challenge was to create the most accurate model on a single GPU in 24 hours.I think that is a great example of how important and useful QLoRA is. Maybe we should run a dual-GPU challenge next time not that multi-GPU is working...

评论 #39653226 未加载

评论 #39639858 未加载

评论 #39639661 未加载

llmzero大约 1 年前

I liked that you link to renting a dual 24GPU for 0.60cents/hour, but how long could it takes to fine tune a 70b model using your system (4 bits for weights)?If I were a consumer I would be interested in the final price of fine tuning, for example a table with model size, training size, cost of training, and expected loss of quality with this technology.One obvious question: Can you apply your technology with the recent (-1,0,1) encoding?, I think you will answers that the (-1,0,1) model is not available and you can't try it, but my question is whether once/if that model is available answer.ai will be able to use the same technology that this post to fine tune a big model in two very small GPUs, and then I should ask for a new table with cost/benefits analysis.Edited: I should add that I find this kind of work very useful for enhancing individual users like me to be able to compete in the applications of LLM market, this is great work and along the lines of the book "from zero to one" (not that I like or dislike the author) to solve the kind of problem that nobody is trying to solve.Edited: Now that I have a total of 23 points in HN, I will change my password to some random one, just to cure my desire to look for votes, and try to make some work, and again some tomorrow create a new presence in HN.

评论 #39643643 未加载

评论 #39640517 未加载

评论 #39652994 未加载

评论 #39640448 未加载

评论 #39649554 未加载

评论 #39649489 未加载

评论 #39641152 未加载

int_19h大约 1 年前

This is great, but one thing I really hoped would come sooner is fast training on Metal. As things are, you can get an M1/M2 Ultra (~800 Gb/s memory bandwidth; for comparison, RTX 4090 is ~1050 Gb/s) Mac Studio with 128Gb RAM for ~$3500. For large model inference, this is already way more affordable than stacking GPUs while being "fast enough", but training solutions are basically non-existent. I do wonder why; it feels like a low-hanging fruit.

评论 #39639105 未加载

评论 #39639100 未加载

eurekin大约 1 年前

This might be the most interesting constructive approach in "Open Source" LLMs I've seen. Grounded, reasonable and inviting to replicate! I wish academia took that as a standard.Great job!

评论 #39639851 未加载

评论 #39641123 未加载

jamesblonde大约 1 年前

This is a fantastic breakthrough for those of us who fine-tune LLMs on limited hardware budgets.I was curious about the choice of FSDP over DeepSpeed. I have been using Axolotl for fine-tuning, and FSDP has been broken there, whilst DeepSpeed is rock solid. Why FSDP over DeepSpeed jph00?

评论 #39638763 未加载

ricopags大约 1 年前

This is such exciting news! Huge thanks to you for your continued work in making sense of AI.I wonder if the recent Bitnet 1.58 paper [the use of ternary bits in lieu of fp/int] might be an advancement that could further reduce the computation required for inference?

评论 #39638160 未加载

itsgrimetime大约 1 年前

Would be cool to build an “LLM@home” project like folding@home or SETI@home (rip), where tons of folks could donate their GPUs and train something huge and FOSS. I don’t know enough about how these models are trained though. Could it be chunked up and distributed in that way, then stitched/merged back together?

评论 #39638639 未加载

评论 #39643210 未加载

评论 #39641623 未加载

评论 #39642291 未加载

keeptrying大约 1 年前

If you are gonna be doing stuff like this I’m damn excited for answer.ai!It’ll be the first time we’ll have someone who knows AI create leverage to open source it.Way to go!

评论 #39640508 未加载

chasd00大约 1 年前

What’s the best way for people to contribute to AI open source? I can’t produce things like this for many reasons so how can I and others like me do our part to keep SOTA AI open?

评论 #39641009 未加载

评论 #39641705 未加载

评论 #39643758 未加载

评论 #39640960 未加载

yalok大约 1 年前

Have you guys looked at using sparsification? It would probably require true re-training of the foundation model, to go at high sparse ratios (say 90% weights excluded), which could be done once on expensive GPU - but fine tuning such sparse models would require less RAM hopefully.The trick with getting more benefit from sparse approach is to do block sparse (iirc, Tim Dettmers used to work on this as well, a few years ago), but large block size (say 16x16) would require much longer retraining to recover for the lost accuracy…

评论 #39638793 未加载

评论 #39639010 未加载

iandanforth大约 1 年前

This is great, however there were many opportunities to use the word 'nibble' in this post and they were all missed.

artninja1988大约 1 年前

So, as I understand it, this is for finetuning a preexisting llm? So not actually training one from scratch. I guess that would be too much to ask for. Nonetheless, cheers to Jeremy and the gang for the work.

评论 #39635991 未加载

评论 #39638108 未加载

评论 #39646576 未加载

pella大约 1 年前

> the ability to use multiple GPUs with QLoRA training.Thorough article!Question: What's your opinion on:- How viable will NVIDIA's consumer cards be in the long run?- Besides <a href="https://tinygrad.org" rel="nofollow">https://tinygrad.org</a>, what other cost-effective future alternatives could there be?

评论 #39638679 未加载

curl-up大约 1 年前

Does anyone have sources, or experience, about fine tuning primarily to teach the model some factual data, especially when it comes to later "higher level" question answering.For example, giving the model a bunch of text (academic papers and such) about 19th century writers, then asking things like "Who were the main influences on writer X"?Obviously simple RAG-like approaches don't work, as such information is rarely available in the text as-is, and needs to be "extrapolated" to some extent. Long context models might work (just dumping everything into the prompt), but are way too expensive for my needs.

评论 #39642733 未加载

buildbot大约 1 年前

Nice, I tried to use QLoRA+FSDP in the past with litgpt and obviously at that time it did not work. This is very useful!

ericd大约 1 年前

This is the best news I’ve seen all month. I think one of the great near-term dangers of AI is the bulk of the economic benefit going mainly to relatively few companies. That risk seems substantially reduced if they have to compete with a great variety of models.

jl6大约 1 年前

Besides being a great result, the quality and clarity of the technical writing here is excellent.

Kelteseth大约 1 年前

Any plans on supporting AMD? In Germany, the price of an 7900XTX is HALF of a NV 4090...

评论 #39641693 未加载

tbenst大约 1 年前

Very interesting but hard to interpret until the performance numbers / benchmarks are available. I can already fine-tune a 70B language model at home using CPU + RAM, but it would be so slow as to be almost totally impractical (~20x slower than GPU). It would be great to see a comparison to eg 8 x A100 (available for $32/hr on AWS on-demand) and also CPU + RAM. Presumably it’s somewhere in between, but hard to predict where!

delegate大约 1 年前

Maybe I've missed it in the article - but how long would a full training run take on 2 consumer GPUs (local or rented) ? Ballpark - hours, days... ?

评论 #39639855 未加载

openquery大约 1 年前

This article is very well written and super informative. One thing I didn't understand is:> At Answer.AI our north star is making useful AI more accessible. $150,000 to create your own high-quality personalized model definitely doesn’t count as accessible!Renting an A100 on RunPod is ~$1.89 / hour. So you'd need ~80,000 A100 hours to train a useful AI model?

评论 #39642169 未加载

jncfhnb大约 1 年前

So… why do people want to fine tune LLMs at home? It seems very unlikely to provide value.* you’re probably not going to succeed at injecting new knowledge in a way that feels satisfyingly top of mind to the bot* you’re probably not going to create such a meaningfully new style that it warrants a Lora like in imagesWhat’s an example use case?

评论 #39642208 未加载

评论 #39642338 未加载

评论 #39642171 未加载

staticman2大约 1 年前

If I wanted to use this software to finetune a 70b model on two 3090s to write fiction, what is the maximum sequence length that would be practical? I'm at the dataset collection stage, but I'm not sure whether to aim for bigger or smaller sequence lengths at the moment.

carbocation大约 1 年前

I wonder whether LoRAs could be useful for U-Net training. Especially thinking of CNN-based U-Net models with pre-trained encoders (but randomly initialized decoders). At least, it seems possible that normal weight updates on the decoder and LoRA training on the encoder could improve efficiency.

评论 #39638730 未加载

Nouser76大约 1 年前

Is there any framework/system that distributes the work across multiple GPUs on different computers over a network (LAN or WAN)? I'm not concerned much about latency or generation time, but would love to train or load up huge models and send jobs to run overnight.

评论 #39648825 未加载

samstave大约 1 年前

OK this is going to come out as moronic because I dont have the proper vocab to phrase it:--Is it possible to 'array' tokenized wokloads across providers of GPU?I want to farm-out my 'compute' across [things]More importantly can there be a marketplace for GPU resources that I can effectively point my local job at?

评论 #39648854 未加载

jiwidi大约 1 年前

> home> two 24GB GPUs.geez

Tostino大约 1 年前

Nice, i've been hoping this would be possible for a while. I'll have to do a new fine-tune of Inkbot on top of one of the 70b models.What are the max context lengths / batch sizes you can train at with this method for 2x24gb? What about 4x24gb?

Havoc大约 1 年前

This is great.I don't think local will be competitive in future IN GENERAL...but if I have a specific use case and I have a specific training dataset...local with specific training will murder the big commercial models.

zerop大约 1 年前

Question - Can I use this to retrain an LLM (70B) weights on my own data? I am using RAG as of now for asking questions on my text, but always wonder if I could retrain an LLM on my own text. Thoughts?

评论 #39640944 未加载

JoelJacobson大约 1 年前

Do the two 4090 GPUs need to be on the same machine, or is it possible to somehow use two separate machines, each with its own 4090, and link them somehow via e.g. InfiniBand?

pama大约 1 年前

Thank you for the repo and write up. What tools (if any) did you use for performance tuning once you achieved the main goal of being able to finetune the model?

m3kw9大约 1 年前

If they can continuously train it, it could be better than a large context as this is how a AI OS would need to work when you have constant updates to your files

评论 #39638640 未加载

Tepix大约 1 年前

Congratulations, fantastic contribution to open source AI. Why does the website headline say "train" instead of "finetune"?

erwincoumans大约 1 年前

NVlink + two 3090 gives 48GB relatively easy (appears as unified memory). I only skimmed the article briefly, was it considered?

评论 #39650182 未加载

hathym大约 1 年前

Imagine the potential of a Folding@Home-inspired project for AI development. What kind of powerful model could a community of gamers and GPU owners create.

评论 #39665842 未加载

seratibp大约 1 年前

Is it worth it though when the base model isn't even smart enough?

chompychop大约 1 年前

Does this support multimodal language models (E.g.: LLaVA)?

lbj大约 1 年前

Can't believe they didn't name this Qolor

g42gregory大约 1 年前

This is brilliant. Thank you for doing his!

sieszpak大约 1 年前

4x 3080???

评论 #39654935 未加载

评论 #39654766 未加载

评论 #39648837 未加载

OOPMan大约 1 年前

Ah yes, 24gb top-of-the-line GPUs, I happen to have a whole warehouse full./s

评论 #39648857 未加载

vouaobrasil大约 1 年前

It would be great if we were a bit more respectful towards our natural resources. Using so much energy to play with language models is a waste of resources.

评论 #39640167 未加载

评论 #39640235 未加载

评论 #39640184 未加载

评论 #39640420 未加载

评论 #39640860 未加载

评论 #39642042 未加载

评论 #39640146 未加载

评论 #39640683 未加载

评论 #39640663 未加载

评论 #39640387 未加载

评论 #39640703 未加载