We have a massive GPU cluster and developed our own infrastructure to manage the cluster and train massive models.<p>There's how it works:<p>1. You upload the dataset with preconfigured format into HuggingFaсe [1].<p>2. Choose your LLM (e.g. LLaMa 70B, Mistral 7B)<p>3. Place your submission into the queue<p>4. Wait for it to get trained.<p>5. Then you get your trained model there on HuggingFace.<p>Essentially, why would we want to do it?<p>1. We already have an experience with training big LLMs.<p>2. We could achieve near-perfect infrastructure performance for training.<p>3. Sometimes GPUs have just nothing to train.<p>Thus we thought it would be cool if we could utilize our GPU cluster 100%. And give back to Open Source community (already built an e2e distributed training framework [2]).<p>This is in an early stage, so you can expect some bugs.<p>Any thoughts, opinions, or ideas are quite welcome!<p>[1]: <a href="https://github.com/higgsfield-ai/higgsfield/blob/main/tutorials/README.md">https://github.com/higgsfield-ai/higgsfield/blob/main/tutori...</a><p>[2]: <a href="https://github.com/higgsfield-ai/higgsfield">https://github.com/higgsfield-ai/higgsfield</a>