Llama from scratch, or how to implement a paper without crying

513 点作者 bkitano19将近 2 年前

15 条评论

GistNoesis将近 2 年前

There is a bug : While in SwiGLU beta is a learnable parameter, in the reference paper the feed forward network set beta as a constant FFnSwiGLU = Swish1... <a href="https://arxiv.org/pdf/2002.05202.pdf" rel="nofollow noreferrer">https://arxiv.org/pdf/2002.05202.pdf</a> (Eq 6.)In official llama implementation : the constant beta has been removed <a href="https://github.com/facebookresearch/llama/blob/main/llama/model.py#L212">https://github.com/facebookresearch/llama/blob/main/llama/mo...</a>In the blog log we observe various lines " feedforward.1.beta', 0.0 " which mean that during the training the beta has degenerated into 0 whereas it should be constant 1.

评论 #37065633 未加载

评论 #37059988 未加载

评论 #37087970 未加载

spi将近 2 年前

Kudos for the work! Stupid comment (not really on the main topic of the blogpost, but might be useful anyway for future "toy example" models): in the initial SimpleBrokenModel class [EDIT: and also in SimpleModel), there is actually quite a bit of wasted computation (something like > 66% of all the model computations!). You are applying, in sequence, the following layers:- embedding 65 -> 128- linear 128 -> 128- ReLU- linear 128 -> 65But since there's no non-linearity at all between the first two layers, and they both are linear... the second one is totally useless. This model is effectively a "classical" single hidden layer MLP. And in terms of FLOPS, it's wasting 128128=16k operations out of a total of 128128+65*128=24k operations.

评论 #37073178 未加载

bravura将近 2 年前

Overall, a good sense of fundamental principles demonstrated.Particularly:"Use .shape religiously. assert and plt.imshow are your friends." Thank you. You should always assert pre and post conditions of shape. (Do bear or typeguard allow you to do this using decorators?)Some nits:"Before you even look at the paper, pick a small, simple, and fast model that you've done in the past. Then make a helper function to evaluate the model qualitatively." Don't you mean quantitatively? So that you establish a numerical baseline against which you can compare the more advanced method."Start by picking apart different components of the paper, and then implementing them one-by-one, training and evaluating as you go." Can you be precise what you mean here? A lot of work is like: "Okay we tried 10 changes things [for unspecified reasons], some major and some minor, to get our final thing, and here's an ablation study to show how much we lose if we remove each piece." If you would say: "Implement the meat first (the major architectural change fundamental to the work, i.e. the ablation study line-item all the way at the bottom with no seasoning or spices on it)" then yeah, that's a good place to start. But you can't start with a broccoli recipe, switch to a meat recipe, and taste it halfway before it's done cooking and you haven't flipped it, you're not going to learn much. This sort of advance is better framed as: "Evaluate each time you make an atomic change to the approach, prioritizing changes in the order that had the most impact in the ablation study from easiest to hardest, respecting the DAG in which certain changes can be made."

评论 #37065277 未加载

评论 #37062102 未加载

matroid将近 2 年前

What is the guiding principle behind using Swiglu instead of Relu? Did the authors decide by simply trying all available non linearities or is there a deeper reason.

评论 #37060607 未加载

bkitano19将近 2 年前

edit: bearblog getting ddos'd, here's the repo <a href="https://github.com/bkitano/llama-from-scratch">https://github.com/bkitano/llama-from-scratch</a>

mike_hearn将近 2 年前

For AI learners like me, here's an attempt to briefly explain some of the terms and concepts in this blog post, in the rough order they appear.A token is a unique integer identifier for a piece of text. The simplest tokenization scheme is just Unicode where one character gets one integer, however LLMs have a limited number of token IDs available for use (the vocabulary), so a more common approach is to glue characters together into common fragments. This post just uses the subset of ASCII needed by TinyShakespeare.The "loss function" is just a measure of how similar the model's prediction is to the ground truth. Lower loss = better predictions. Different tasks have different loss functions, e.g. edit distance might be one (but not a good one). During training you compute the loss and will generally visualize it on a chart. Whilst the line is heading downwards your NN is getting better, so you can keep training.PyTorch is a library for working with neural networks and tensors. A tensor is either a single number (0 dimensions, a scalar), an array of numbers (1 dimension, a vector), or a multi-dimensional array of numbers where the 2-dimensional case is called a matrix. But a tensor can have any number of dimensions. PyTorch has a relatively large amount of magic going on in it via reflection and other things, so don't expect the code to make much intuitive sense. It's building a computation graph that can be later executed on the GPU (or CPU). The tutorial is easy to read!A neural network is a set of neurons, each of which has a number called the bias, and connections between them each of which has an associated weight. Numbers (activations) flow from an input neuron through the connections whilst being adjusted by the weights to arrive at an output neuron, those numbers are then summed then multiplied by the bias before being emitted again to the next layer. The weights and biases are the network parameters and encode its knowledge.A linear layer is a set of input neurons connected to a set of output neurons, where every input is connected to every output. It's one of the simplest kinds of neural network structure. If you ever saw a diagram of a neural network pre-2010 it probably looked like that. The size of the input and output layers can be different.ReLU is an activation function. It's just Math.max(0, x) i.e. it sets all negative numbers to zero. These are placed on the outputs of a neuron and are one of those weird mathematical hacks where I can't really explain why it's needed, but introducing "kinks" in the function helps the network learn. Exactly what "kinks" work best is an open area of exploration and later the author will replace ReLU with a newer more complicated function.Gradients are kind of numeric diffs computed during training that are used to update the model and make it more accurate.Batch normalization is a way to process the numbers as they flow through the network, which helps the network learn better.Positional encodings help the network understand the positions of tokens relative to each other, expressed in the form of a vector.The `@` infix operator in Python is an alias for the __matmul__ method and is used as a shorthand for matrix multiplication (there are linear algebra courses on YouTube that are quite good if you want to learn this in more detail).An epoch is a complete training run of the dataset. NNs need to be shown the data many times to fully learn, so you repeat the dataset. A batch is how many of the items in the dataset are fed to the network before updating the parameters. These sorts of numbers are called hyperparameters, because they're things you can fiddle with but the word parameters was already used for weights/biases.Attention is the magic that makes LLMs work. There are good explanations elsewhere, but briefly it processes all the input tokens in parallel to compute some intermediate tensors, and those are then used in a second stage to emit a series of output tokens.

评论 #37067493 未加载

评论 #37064699 未加载

评论 #37065028 未加载

评论 #37063237 未加载

评论 #37063438 未加载

boredumb将近 2 年前

Seriously great post - One of those that I read and immediately starting wishing I had read something like this a few years ago when it was all still a bit alien to me and had it explained in less.. digestible bits. Regardless I got a ton out of this very well done

albertzeyer将近 2 年前

Whenever there is some working existing implementation of a model (and maybe even checkpoint), the most effective way to be sure your model implementation is correct is to import such an existing checkpoint and compare the model output. If it does not match (which is almost always the case, as you likely got some details wrong), you can systematically go through each of the layers. You will figure out the real differences and learn. Maybe you will even find some oddities in the existing implementation.This is about the model itself. Training is another aspect. But usually after having the hyper parameters more or less similar, this should be fine, if the model is correct.

fstrazzante将近 2 年前

love it! great content! both how to read a paper and of course the content of this specific paper! and I recommend as well the Karpathy's Makemore series!

garfieldnate将近 2 年前

The TL;DR pointers are really great, and the note about asserting the shape of tensors applies any common linear algebra library out there, as far as I know. When working on complex LA code, it's extremely important to take small steps and code defensively. In my opinion programming linear algebra with any mainstream language is absolutely atrocious due to lack of compile-time checking of tensor shapes, which should properly be part of a tensor's type and would make it impossible to compile if you're trying to multiply a 3x4 by a 3x4 without transposing first. It really, really sucks to run a long calculation only to fail on an operation due to mismatched dimensions.IMO PyTorch tensors should also have their device statically typed; right now you get a run-time error if you try to multiply a tensor in CPU memory by one in GPU memory.

quickthrower2将近 2 年前

Llama is one of the nicer papers to read IMO.

zackcodesai将近 2 年前

Looks like we DDoS'd the server...

评论 #37059739 未加载

Mrjck将近 2 年前

<a href="https://news.ycombinator.com/item?id=37059745">https://news.ycombinator.com/item?id=37059745</a>

forrestthewoods将近 2 年前

This is amazing. Thanks for sharing!

Mrjck将近 2 年前

I will ravage the world's network security system

15 条评论

GistNoesis将近 2 年前

评论 #37065633 未加载

评论 #37059988 未加载

评论 #37087970 未加载

spi将近 2 年前

评论 #37073178 未加载

bravura将近 2 年前

评论 #37065277 未加载

评论 #37062102 未加载

matroid将近 2 年前

What is the guiding principle behind using Swiglu instead of Relu? Did the authors decide by simply trying all available non linearities or is there a deeper reason.

评论 #37060607 未加载

bkitano19将近 2 年前

edit: bearblog getting ddos'd, here's the repo <a href="https://github.com/bkitano/llama-from-scratch">https://github.com/bkitano/llama-from-scratch</a>

mike_hearn将近 2 年前

评论 #37067493 未加载

评论 #37064699 未加载

评论 #37065028 未加载

评论 #37063237 未加载

评论 #37063438 未加载

boredumb将近 2 年前

albertzeyer将近 2 年前

fstrazzante将近 2 年前

love it! great content! both how to read a paper and of course the content of this specific paper! and I recommend as well the Karpathy's Makemore series!

garfieldnate将近 2 年前

quickthrower2将近 2 年前

Llama is one of the nicer papers to read IMO.

zackcodesai将近 2 年前

Looks like we DDoS'd the server...

评论 #37059739 未加载

Mrjck将近 2 年前

<a href="https://news.ycombinator.com/item?id=37059745">https://news.ycombinator.com/item?id=37059745</a>

forrestthewoods将近 2 年前

This is amazing. Thanks for sharing!

Mrjck将近 2 年前

I will ravage the world's network security system