Hah funny to see this on HN, it is a relatively old project but one that I continue to love and still work on. I was trying to train a GPT one day and discovered that available implementations were quite complex, spread across many files, and took way too many kwargs switches for esoteric/rare options that just bloated and complexified the code. But in my head a GPT was a super simple neat, isotropic model, so I got all worked up and wrote minGPT.<p>The project went on to have more impact than I originally imagined and made its way into a number of projects and papers. One of those I found only a few days ago here: <a href="https://twitter.com/karpathy/status/1566100736076697600" rel="nofollow">https://twitter.com/karpathy/status/1566100736076697600</a> . What I love about these projects is that the authors often "hack up" minGPT in code directly. They don't configure a comprehensive kwarg monster. I think there's a beauty in that. Very often I wish we had more gists and fewer frameworks - to look at code chunks, understand them completely, tune them to our needs, and re-use them in projects, similar to how bacteria trade little DNA plasmids. minGPT is written for those who want that for their GPT projects. There's plenty of cons to this approach too, ultimately I think there's value in both approaches.<p>Coming up the theme of future minGPT development: more examples, and more teeth - it should be possible to demonstrate the training of relatively serious (~few B) models with minGPT on one n-gpu node and reproduce some benchmarks around that scale, but never sacrifice its readability.
This is actually a pretty neat, self-contained implementation that can super easily extended beyond stereotypical natural language models, for example to create world models for video games [1] or to create robot models that can learn to imitate from large, chaotic human demonstration data [2] (disclaimer, I'm an author on the second one.) Basically, GPT (or minGPT) models are EXCELLENT sequence modelers, almost to the point where you can throw any sensible sequence data at it and hope to get interesting results, as long as you don't overfit.<p>Even though I have only been working on machine learning for around six years, it's crazy to see how the landscape has changed so fast so recently, including diffusion models and transformers. It's not too much to say that we might expect more major breakthroughs by the end of this decade, and end in a place we can't even imagine right now!<p>[1] <a href="https://github.com/eloialonso/iris" rel="nofollow">https://github.com/eloialonso/iris</a>
[2] <a href="https://github.com/notmahi/bet" rel="nofollow">https://github.com/notmahi/bet</a>
I love your approach and philosophy around programming. If anyone is unaware, Karpathy has a relatively small youtube channel he started a few weeks ago. <a href="https://youtu.be/VMj-3S1tku0" rel="nofollow">https://youtu.be/VMj-3S1tku0</a>
With enough training data and enough GPUs to do the model training, you'll be there! Goes to show that for AI, the code really isn't the important part. AI is and always has been about data and compute.