Do LLMs not consider the probability distribution over all combinations of tokens up to a certain output length with regard to sequence prediction? I assumed they did that already.<p>If they don’t, I’m amazed they work as well as they do. Consider 2-bit sequence prediction with the following possible outcomes and associated probabilities:<p><pre><code> 00: p=0.36
01: p=0.04
10: p=0.30
11: p=0.30
</code></pre>
So the most likely 2-bit sequence is 00. But on the basis of predicting the next token (bit) alone, we have:<p><pre><code> 0: p=0.40
1: p=0.60
</code></pre>
which suggests that 1 is the next bit and leads to a suboptimal starting point for predicting the bit after that. The error is even more prominent with longer sequences as the joint probability distribution becomes more unfactorizable into marginal distributions (as I would expect any minimal algorithmic description of real-world data to be).<p>Edit: now that I think about this a bit more, a cool research project that would be really simple to carry out might be to modify the cross-entropy loss function to consider <i>only</i> the nth future token in the text training data, and then plot LLM performance vs n, assuming that for all current LLM models we just have n=1.<p>My hypothesis is that you can mostly bypass all of the resource blow-up involved in predicting the joint probability distribution over the next 1 through n tokens (which scales as x^n) by just predicting the nth token directly, since doing so would implicitly require a better data model (at least for human-generated text; this wouldn’t be the case for all types of data).