This is cool, and timely (I wanted a neat repo like that).<p>I have also been working from last 2 weeks on a gpt implementation in C. Eventually it turned out to be really slow (without CUDA). But it taught me how much memory management and data management there is when implementing these systems. You are running like a loop billions of times so you need to preallocate the computational graph and stuff. If anyone wanna check out it's ~1500 LOC single file:<p><a href="https://github.com/attentionmech/gpt.c/blob/main/gpt.c">https://github.com/attentionmech/gpt.c/blob/main/gpt.c</a>
Neat, I love projects like these.<p>The next level down is to do it directly in numpy.<p>And then from there, write a minimal numpy work-a-like to support the model above.<p>You start with a working system using the most powerful abstractions. Then you iteratively remove abstractions, lowering your solution, then when you get low enough but still riding on an external abstraction, you rewrite that, but ONLY to support the layers above you.<p>Following the above pattern, you can bootstrap yourself to have full system understanding. This is not unlike RL+distillation that human persons do learn complex topics.
Can someone help me understand what I’m looking at here? This repository allows me to train a specific model on a specific data set, and finally test the result? Is that correct?<p>I am interested in how large and small language models are trained, but as someone who has little knowledge in this world I find it hard to cut through the noise to find useful information.<p>Really I’m looking for an open source project that helps a person gain this knowledge. Something like a docker container that encapsulates all the dependencies. When training it will use any available gpu or tell me why my gpu can’t be used and then fall back to cpu. Then had a simple interface to test the training results. Finally you can easily pull back the curtain to understand the process in better detail and maybe even adapt it to different model to experiment.<p>Does something like that exist?
github has a bunch of them for years, the most known from Andrej Karpathy:<p><a href="https://github.com/karpathy/nanoGPT">https://github.com/karpathy/nanoGPT</a><p>some other have MoE implemented.
Here's a google collab notebook built from this. It takes ~2 hours on A100 GPU if you have collab pro. Might work on free account as well.<p><a href="https://colab.research.google.com/drive/1dklqzK8TDPfbPbyHrk3llXFOOiOhFUeJ?usp=sharing#scrollTo=BEgEJhqeLAgg" rel="nofollow">https://colab.research.google.com/drive/1dklqzK8TDPfbPbyHrk3...</a>
The example story is interesting.<p>I have made my own implementation from scratch with my own multi-channel tokeniser, each channel gets its own embedding table 32768, 256,256, 64, and 4. Which are summed along with the position encoding.<p>Yet with all of those differences, my stories have Lily as a protagonist often enough that I thought I had a bug somewhere.<p>Might have to check tinystories for name distribution.<p>Most questionable output from mine so far:<p>"one day, a naughty man and a little boy went to the park place to find some new things."
It’s interesting that technology so transformative is only a few hundred lines of code (excluding underlying frameworks and such).<p>How big would you guess state of the art models are, in terms of lines of code?
So, this has nothing to do with "SmolLM" - a set of models (with data, training recipes, etc) released by HuggingFace? <a href="https://huggingface.co/blog/smollm" rel="nofollow">https://huggingface.co/blog/smollm</a>
I noticed several people mentioned Karpathy already, but I wanted to include that his tiny "Micrograd" project (see Youtube Video and GitHub) is a great introduction to Neural Nets (Multilayer Peceptron), which is at the core of [most] machine learning of course.
Looks like a rip off of - <a href="https://github.com/PraveenRaja42/Tiny-Stories-GPT">https://github.com/PraveenRaja42/Tiny-Stories-GPT</a><p>without any credits to above or TinyStories paper.