Better and Faster Large Language Models via Multi-Token Prediction

302 点作者 jasondavies大约 1 年前

16 条评论

deskamess大约 1 年前

Side track: There is so much going on this space. I wish there was a chronological flow of a machine learning scenario/story with all the terms being introduced as we meet them (data, pre-training, training, inference, mixture of experts, RAG). Like someone walking me through a factory explaining what happens at each stage (like Mr Rogers used to do). Most of the time I do not know where the terms fit in the big picture. When I first came across pre-training I thought it was something done to the data before training happened but it was actually another training.

评论 #40221947 未加载

评论 #40221953 未加载

评论 #40222987 未加载

评论 #40222785 未加载

评论 #40222839 未加载

评论 #40229699 未加载

评论 #40313457 未加载

albertzeyer大约 1 年前

For those who know speculative decoding: This is basically self-speculative decoding. It still auto-regressively feeds the predicted label sequence through the network again, and only keeps the prediction up to the point where it matches. So it will not get worse in performance but only faster (here up to 3 times, which is normal for speculative decoding).Due to the multi-task training, it will however also get better. (This idea is already quite old, to predict multiple targets into the future as an auxiliary loss.)Nice work.

评论 #40222510 未加载

评论 #40223112 未加载

Xcelerate大约 1 年前

Do LLMs not consider the probability distribution over all combinations of tokens up to a certain output length with regard to sequence prediction? I assumed they did that already.If they don’t, I’m amazed they work as well as they do. Consider 2-bit sequence prediction with the following possible outcomes and associated probabilities:<pre><code> 00: p=0.36 01: p=0.04 10: p=0.30 11: p=0.30 </code></pre> So the most likely 2-bit sequence is 00. But on the basis of predicting the next token (bit) alone, we have:<pre><code> 0: p=0.40 1: p=0.60 </code></pre> which suggests that 1 is the next bit and leads to a suboptimal starting point for predicting the bit after that. The error is even more prominent with longer sequences as the joint probability distribution becomes more unfactorizable into marginal distributions (as I would expect any minimal algorithmic description of real-world data to be).Edit: now that I think about this a bit more, a cool research project that would be really simple to carry out might be to modify the cross-entropy loss function to consider only the nth future token in the text training data, and then plot LLM performance vs n, assuming that for all current LLM models we just have n=1.My hypothesis is that you can mostly bypass all of the resource blow-up involved in predicting the joint probability distribution over the next 1 through n tokens (which scales as x^n) by just predicting the nth token directly, since doing so would implicitly require a better data model (at least for human-generated text; this wouldn’t be the case for all types of data).

评论 #40223536 未加载

评论 #40222877 未加载

评论 #40222289 未加载

评论 #40224498 未加载

评论 #40222301 未加载

评论 #40224243 未加载

评论 #40223433 未加载

评论 #40223446 未加载

评论 #40245667 未加载

评论 #40223305 未加载

mg大约 1 年前

Currently, LLMs start from scratch for each output token, right?Lets say you ask an LLM<pre><code> What makes bananas yellow? </code></pre> And it replies<pre><code> Bananas are yellow due to a pigment called bromelain. </code></pre> I would think that the concept of "pigment" and "bromelain" are already somehow activated in the neural net when it outputs "a". Because now it can't change its mind anymore and follow up with "an optical illusion that makes humans perceive every bent object as yellow". So it seems to have already planned ahead to talk about the pigment called bromelain.Would it be possible to capitalize on the work that has already been done when the LLM outputs "a"? Could the state of the neural net be somehow preserved for the next answer?

评论 #40222864 未加载

评论 #40221738 未加载

评论 #40223020 未加载

评论 #40222137 未加载

评论 #40221840 未加载

评论 #40225268 未加载

评论 #40221700 未加载

评论 #40221970 未加载

throw310822大约 1 年前

Apologies in advance for the super naive question; but assuming that we can create vectors that encode for the meaning of entire sentences, what prevents us from training llms to predict those vectors instead of single words?

评论 #40221311 未加载

评论 #40221232 未加载

评论 #40223553 未加载

评论 #40221368 未加载

评论 #40221162 未加载

评论 #40221612 未加载

评论 #40221948 未加载

评论 #40221160 未加载

nicklecompte大约 1 年前

I haven't read the paper in full detail yet, but I do have a minor editorial comment: while the appendix L.2 was satisfactory, I thought the condensed argument in 5.2 was a bit too sloppy. In particular,<pre><code> H(X) + H(Y) = H(X | Y) + 2I(X ; Y) + H(Y | X) By discarding H(Y | X) - which appears again when predicting at the following position - we observe that 2-token prediction increases the importance of I(X ; Y) by a factor of 2. </code></pre> The argument about "discarding" was not clear to me - if you're predicting the third token Z, then shouldn't H(Y | X) be contained in the implicit context C, and therefore can't be freely discarded? I don't think this argument was clarified in the appendix. But this is mostly about presentation, I wasn't so confused as to doubt the gist of the argument.

评论 #40222164 未加载

bradley13大约 1 年前

I read an article that pointed out that LLMs literally have a one-dimensional window onto the world. Everything is just a sequence of tokens.Maybe this sort of multi more fiction takes their view into 1.1 dimensions? In any gas, there us s real argument for expanding that window, somehow, into two or more dimensions.

评论 #40221467 未加载

WhitneyLand大约 1 年前

The use of the word “head” in machine learning does not seem consistent, in case anyone else is confused by that.There’s multihead attention and multiple output heads as a concept in the paper.Multihead attention is about focusing on different areas of the input in transformer architectures, and the biological analogy here is head as a central processing unit.An output head refers to the final layer of a neural network, of which you could have more than one producing different outputs based on the same previous layers. This is also a loose biological analogy, but instead of head as cpu, think more along the lines of head being on one end of the body.In neither case is there any analogy to a tape head that reads data.

ralusek大约 1 年前

Given that LLMs appear to, in large part, "think" by virtue of feeding its own input into itself, people have consistently noticed that insisting that the model "think out loud" results in higher quality reasoning. i.e. "chain of thought" reasoning will contrast simply having the model answer a question directly with first having it write out things like:- restating what it thinks is being asked of it- expressing a high level strategy over what sort of information it might need in order to answer that question- stating the information it knows- describing how that information might inform its initial reasoningetc...I'd be concerned that going about this by having the model predict the next multiple tokens at any given time would essentially have the opposite effect.Chain of thought prompting appears to indicate that a model is "smarter" when it has n + m tokens than when it just has n tokens as input. As such, getting the next 5 tokens for a given n might net worse results than getting the next 1 token at n, then the next 1 token at n + 1, and so on.

评论 #40222626 未加载

Havoc大约 1 年前

How does that still end up making grammatical sense?If token/word +1 and +2 are predicted independently then surely often it won’t ?

评论 #40221144 未加载

评论 #40221087 未加载

评论 #40221054 未加载

评论 #40221063 未加载

bjornsing大约 1 年前

I’ve been thinking about this, but I’m leaning more towards letting the LLM output a small PixelCNN or similar model over the next N tokens. That way the LLM can describe conditional probabilities over the coming tokens.

hhcoder大约 1 年前

I am curious what happens if the multiple tokens predicted interfere with one another. Say I ask "What are the colors of the rainbow?", if one of the tokens is a repeated color, how do we resolve that?

bravura大约 1 年前

I wonder if, instead of just predicting the next n tokens, it could also predict like 128, 512, 2048 etc tokens ahead. Thus learning long-term discourse structure.

评论 #40223269 未加载

riku_iki大约 1 年前

Its interesting that they got good results on 200B and 0.8 epoch training set, but once scaled it to 1T and 4 epoch, got degradation in vast majority of benchmarks (Table 1).

lucidrains大约 1 年前

wow, so prophet net does work! i spent so much time experimenting with it back in the day, but just lacked the scale to see a positive result.

jmount大约 1 年前

After inventing multi-token, one then invents a useful language oriented hierarchy (such as sections, paragraphs, sentences, and words).