It was fun to follow the public TinyLlama loss curves in near real-time, although it showed that it can be frustrating since the loss curves barely moved down even after an extra trillion tokens: <a href="https://wandb.ai/lance777/lightning_logs/reports/metric-train_loss-23-09-04-23-38-15---Vmlldzo1MzA4MzIw?accessToken=5eu2sndit2mo6eqls8h38sklcgfwt660ek1f2czlgtqjv2c6tida47qm1oty8ik9" rel="nofollow">https://wandb.ai/lance777/lightning_logs/reports/metric-trai...</a> (note the log-scaled X-axis)<p>But they <i>did</i> move down and that's what's important.<p>There should probably be more aggressive learning rate annealing for models trying to be Chinchilla-optimal instead of just cosine-with-warmup like every other model nowadays.
From the GitHub repo Readme:<p>> we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs<p>I knew the computational power required to train LLMs was absurd, but seeing the figures of larger networks (which are just too large to intuitively understand) it didn't really register. With this one I could actually imagine the 16 machines with A100 GPUs sitting on a server room running at full blast for 90 days so it was more tangible... And now to think about the larger ones is kinda scary<p>Edit: Did the math and just the GPUs (at 250W each) consumed around 8.64 MWh, which is at the same ballpark of the power consumption of the average US home in one year (10.5MWh)
I've been using one of the earlier checkpoints for benchmarking a Llama implementation. Completely anecdotally I feel at least as good or better about this one than the earlier openllama 3B. I wouldn't use either of them for RAG or anything requiring more power, just to say that it's competitive as a smaller model, whatever you use those for, and easy to run on CPU at FP16 (meaning without serious quantization).
Link to model on HF Hub: <a href="https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0" rel="nofollow">https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0</a>
GitHub repo with links to the checkpoints: <a href="https://github.com/jzhang38/TinyLlama">https://github.com/jzhang38/TinyLlama</a>
OP here with a shameless plug: for anyone interested, I'm working on a site called Emergent Mind that surfaces trending AI/ML papers. This TinyLlama paper/repo is trending #1 right now and likely will be for a while due to how much attention it's getting across social media: <a href="https://www.emergentmind.com/papers/2401.02385" rel="nofollow">https://www.emergentmind.com/papers/2401.02385</a>. Emergent Mind also looks for and links to relevant discussions/resources on Reddit, X, HackerNews, GitHub, and YouTube for every new arXiv AI/ML paper. Feedback welcome!