I don’t train LLMs from scratch, but I have:<p>3x4090s
1xTesla A100<p>Lots of fine tuning, attention visualisation, evaluation of embeddings and different embedding generation methods, not just LLMs though I use them a lot for deep nets of many kinds<p>Both for my day job (hedge fund) and my hobby project <a href="https://atomictessellator.com" rel="nofollow">https://atomictessellator.com</a><p>It’s summer here in NZ and I have these in servers mounted in a freestanding server rack beside my desk, and it is very hot in here XD
Some people have been fine-tuning mistral 7B and phi-2 on their high-end macs. Unified memory is a hell of a thing. The resulting model here is not spectacular but as a proof of concept it's pretty exciting what you get in 3.5 hours on a consumer machine.<p>- Apple M2 Max 64GB shared RAM<p>- Apple Metal (GPU), 8 threads<p>- 1152 iterations (3 epochs), batch size 6, trained over 3 hours 24 minutes<p><a href="https://www.reddit.com/r/LocalLLaMA/comments/18ujt0n/using_gpus_on_a_mac_m2_max_via_mlx_update_on/" rel="nofollow">https://www.reddit.com/r/LocalLLaMA/comments/18ujt0n/using_g...</a>
A self built machine with dual 4090s, soon to be 3x. Watercooled for quieter operation.<p>Did the math on how much using runpod per day would be, and bought this setup instead.<p>Using Fully sharded data parallel and bfloat16, I can train a 7b param model very slowly. But that’s fine for only going 2000 steps!
I doubt many people are using local setups for serious work.<p>Even fine tuning Mixtral is 4xH100 for 4 days. Which is a ~$200k server currently.<p>To fully train, not just fine tune a small model, say Llama 2 7b you need over 128GiB of vram, so still multiple GPU territory, likely A100s or H100s.<p>This is all dependent upon the settings you use, increase the batch size and you will see even more memory utilization.<p>I believe a lot of people see these models running locally and assume training is similar, but it isn't.