A GPT in 60 Lines of NumPy

1563 pointsby squidhunterover 2 years ago

22 comments

jaykmodyover 2 years ago

Hey ya'll author here!Thank you for all the nice and constructive comments!For clarity, this is ONLY the forward pass of the model. There's no training code, batching, kv cache for efficiency, GPU support, etc ...The goal here was to provide a simple yet complete technical introduction to the GPT as an educational tool. Tried to make the first two sections something any programmer can understand, but yeah, beyond that you're gonna need to know some deep learning.Btw, I tried to make the implementation as hackable as possible. For example, if you change the import from `import numpy as np` to `import jax.numpy as np`, the code becomes end-to-end differentiable:<pre><code> def lm_loss(params, inputs, n_head) -> float: x, y = inputs[:-1], inputs[1:] output = gpt(x, **params, n_head=n_head) loss = np.mean(-np.log(output[y])) return loss grads = jax.grad(lm_loss)(params, inputs, n_head) </code></pre> You can even support batching with `jax.vmap` (<a href="https://jax.readthedocs.io/en/latest/_autosummary/jax.vmap.html" rel="nofollow">https://jax.readthedocs.io/en/latest/_autosummary/jax.vmap.h...</a>):<pre><code> gpt2_batched = jax.vmap(gpt2, in_axes=0) gpt2_batched(batched_inputs) # [batch, seq_len] -> [batch, seq_len, vocab] </code></pre> Of course, with JAX comes in-built GPU and even TPU support!As far as training code and KV Cache for inference efficiency, I leave that as an exercise for the reader lol

评论 #34731007 未加载

评论 #34731844 未加载

评论 #34735703 未加载

评论 #34731457 未加载

评论 #34735043 未加载

评论 #34734275 未加载

评论 #34735742 未加载

评论 #34769341 未加载

simonwover 2 years ago

This article is an absolutely fantastic introduction to GPT models - I think the clearest I've seen anywhere, at least for the first section that talks about generating text and sampling.Then it got to the training section, which starts "We train a GPT like any other neural network, using gradient descent with respect to some loss function".It's still good from that point on, but it's not as valuable as a beginner's introduction.

评论 #34730345 未加载

评论 #34729132 未加载

评论 #34729778 未加载

评论 #34730267 未加载

评论 #34731502 未加载

barbazooover 2 years ago

So much criticism in the comments. I appreciated the write-up and the code samples. For some people not in ML like myself it's hard to understand the concept behind GPT and this made it a little bit clearer.

评论 #34730364 未加载

lspearsover 2 years ago

For those interested I would also check out Andrej Karpathy's YouTube video on building GPT from scratch:<a href="https://youtu.be/kCc8FmEb1nY" rel="nofollow">https://youtu.be/kCc8FmEb1nY</a>

评论 #34730454 未加载

ultrasounderover 2 years ago

I also learnt a ton from NLPDemystified-<a href="https://www.nlpdemystified.org" rel="nofollow">https://www.nlpdemystified.org</a>. In fact I used this resource first before attempting Andrej Karpathy's <a href="https://karpathy.ai/zero-to-hero.html" rel="nofollow">https://karpathy.ai/zero-to-hero.html</a>. I find Nitin's voice soothing and am able to focus more. I also found the pacing good and the course introduces a lots of concepts a beginner level and also points to appropriate resources along the way(spacy for instance). Overall an exciting time to be a total beginner looking to grok NLP concepts.

adamnemecekover 2 years ago

It turns out that transformers have a learning mechanism similar to autodiff but better since it happens mostly within the single layers as opposed to over the whole graph. I wrote a paper on this recently <a href="https://arxiv.org/abs/2302.01834v1" rel="nofollow">https://arxiv.org/abs/2302.01834v1</a>. The math is crazy.

评论 #34727410 未加载

评论 #34727305 未加载

评论 #34728602 未加载

评论 #34740177 未加载

eddsh1994over 2 years ago

Why do people in ML put imports inside function definitions?

评论 #34727540 未加载

评论 #34731049 未加载

评论 #34727389 未加载

评论 #34727580 未加载

评论 #34727385 未加载

评论 #34729838 未加载

评论 #34728073 未加载

评论 #34729611 未加载

评论 #34727934 未加载

评论 #34734352 未加载

评论 #34728804 未加载

评论 #34727582 未加载

teaearlgraycoldover 2 years ago

Reminds me the scene from Westworld where they explain their failed prototypes of the human mind with millions of lines of code. The version that finally worked was only a few dozen.

qwerty456127over 2 years ago

How powerful/heavy it is? Some time ago here was a post about implementing a GPT on a very constrained computer (under a gigabyte of RAM, some old CPU, no GPU (?)) as opposed to an ordinary kind of GPT requiring terabytes of RAM.I immediately thought it would be nice to do something in the middle: taking full advantage of a reasonably modern multicore CPU with AVX support, a humble yet again reasonably modern OpenCL-capable GPU and some 32 Gigabytes of RAM.

lvwarrenover 2 years ago

Make this change in utils.py:<pre><code> def load_gpt2_params_from_tf_ckpt(tf_ckpt_path, hparams): [...] #name = name.removeprefix("model/") name = name[len('model/'):] </code></pre> and you're cool example will run in Google Colab under Python 3.8 otherwise the 3.9 Jupyter patching is a headache.

estover 2 years ago

> GPT-3 was trained on 300 billion tokens of text from the internet and books:> GPT-3 is 175 billion parametersTotal newbie here. What does these two numbers mean?If running huge number of texts through BPE, we get a array with length of 300B ?What's the number if we de-dup these tokens? (size of vocab?)175B parameters means there are somewhat useful 175B floats in the pre-trained neural network?

评论 #34751882 未加载

eslaughtover 2 years ago

I know this probably isn't intended for performance, but it would be fun to run this in cuNumeric [1] and see how it scales.[1]: <a href="https://github.com/nv-legate/cunumeric">https://github.com/nv-legate/cunumeric</a>

voz_over 2 years ago

Wonderfully written, I love the amount of detail put into the diagrams. Would love breakdowns like this for more stuff :)

durdnover 2 years ago

Very impressive. Recently I watched this really amazing lecture on building GPT from scratch from Karpathy, I was blown away: <a href="https://www.youtube.com/watch?v=kCc8FmEb1nY&t=642s">https://www.youtube.com/watch?v=kCc8FmEb1nY&t=642s</a>

estover 2 years ago

If I maintain an open source project, could I build a doc page using a small GPT allowing users to query FAQ and common methods using natural language?

评论 #34735772 未加载

thomasfromcdnjsover 2 years ago

This reads really well, thank you very much.

lvwarrenover 2 years ago

make this change and it will run under Python 3.8 in google colab<pre><code> #name = name.removeprefix("model/") name = name[len('model/'):] </code></pre> in function: load_gpt2_params_from_tf_ckpt in the utils.py module

sva_over 2 years ago

Impressive, but only forward pass.

评论 #34727030 未加载

评论 #34726992 未加载

评论 #34727633 未加载

insane_dreamerover 2 years ago

nice and clear. a worthy contribution to the subject.

terran57over 2 years ago

From the article:"Of course, you need a sufficiently large model to be able to learn from all this data, which is why GPT-3 is 175 billion parameters and probably cost between $1m-10m in compute cost to train.[2]"So, perhaps better title would be "GPT in 60 Lines of Numpy (and $1m-$10m)"

评论 #34727595 未加载

评论 #34729500 未加载

评论 #34729151 未加载

评论 #34727371 未加载

评论 #34727947 未加载

eric_huiover 2 years ago

fantastic article about GPT. Thank you for sharing

freecodyxover 2 years ago

Since most models require little code compared to big software projects, why not use c++ or any other compiled language directly. Python with it’s magic functions, shortcuts is just hiding too much complexity which can result in bug performance issues. Plus code is more hard to maintain

评论 #34729108 未加载

评论 #34733702 未加载

评论 #34727930 未加载

22 comments

jaykmodyover 2 years ago

评论 #34731007 未加载

评论 #34731844 未加载

评论 #34735703 未加载

评论 #34731457 未加载

评论 #34735043 未加载

评论 #34734275 未加载

评论 #34735742 未加载

评论 #34769341 未加载

simonwover 2 years ago

评论 #34730345 未加载

评论 #34729132 未加载

评论 #34729778 未加载

评论 #34730267 未加载

评论 #34731502 未加载

barbazooover 2 years ago

评论 #34730364 未加载

lspearsover 2 years ago

For those interested I would also check out Andrej Karpathy's YouTube video on building GPT from scratch:<a href="https://youtu.be/kCc8FmEb1nY" rel="nofollow">https://youtu.be/kCc8FmEb1nY</a>

评论 #34730454 未加载

ultrasounderover 2 years ago

adamnemecekover 2 years ago

评论 #34727410 未加载

评论 #34727305 未加载

评论 #34728602 未加载

评论 #34740177 未加载

eddsh1994over 2 years ago

Why do people in ML put imports inside function definitions?

评论 #34727540 未加载

评论 #34731049 未加载

评论 #34727389 未加载

评论 #34727580 未加载

评论 #34727385 未加载

评论 #34729838 未加载

评论 #34728073 未加载

评论 #34729611 未加载

评论 #34727934 未加载

评论 #34734352 未加载

评论 #34728804 未加载

评论 #34727582 未加载

teaearlgraycoldover 2 years ago

Reminds me the scene from Westworld where they explain their failed prototypes of the human mind with millions of lines of code. The version that finally worked was only a few dozen.

qwerty456127over 2 years ago

lvwarrenover 2 years ago

estover 2 years ago

评论 #34751882 未加载

eslaughtover 2 years ago

voz_over 2 years ago

Wonderfully written, I love the amount of detail put into the diagrams. Would love breakdowns like this for more stuff :)

durdnover 2 years ago

estover 2 years ago

If I maintain an open source project, could I build a doc page using a small GPT allowing users to query FAQ and common methods using natural language?