Am I missing something? From what I understood from Wolfram description of GPT and GPT in 60 lines of Python, a GPT model's only memory is the input buffer. So 4k token for GPT3, some more but still limited for GPT4.<p>To summarize the GPT inference process as I understood it, with GPT3 as example:<p>1) the input buffer is made of 4k token. There are about 50k token. So the input is a vector of token ids. We can see it as a point in a high dimensional space;<p>2) The core neural network is a pure function: for such an input point, it will return an output vector as large as there are token. So here, a 50k element vector, where each entry is the probability that the associated token is the next element.<p>The very important thing here is that the whole neural network is a pure function: same input, same output. With immensely large super fast memory this function could be implemented as a look-up table, from an input point (buffer) to an output probability vector. No memory, no side effect here.<p>3) The probability vector is fed into a "next token" function. It doesn't just take the highest probability token (boring result), but use a "temperature" to randomize a bit, while using the output probabilities;<p>4) The next token chosen is inserted into the input buffer, keeping the same total number of token. Go back to (1) until a "stop" token is selected at (3).<p>So in effect, the whole process is a function from a point to a point. "point" here is the buffer seen as a (high dimensional) vector, so a point in a high dimension space. The generation process is in effect a walk in this "buffer space". Prompting puts the model into some part of the state, with some semantic relation to the prompt semantic content (that's the magic part). Then generation is a walk in this space, with a purely deterministic part (2) and a bit of randomization (3) to make the walk trajectory (and its meaning, which is what we care about) more interesting to us.<p>So if this is correct, there is no point in injecting a lot of data into a GPT model: the output is defined by the input buffer size. Just input the last 4k token (for GPT3, more for GPT4) and you're done: everything else would have disappeared. So here, just input the last 4k token of a repo and save some money ;)<p>To avoid this limitation, one would have to summarize the previous input, and make this summary part of the current input buffer. This is what chaining is all about if I understood correctly. But I don't see chaining here.<p>Sooo... Am I missing something? Or is the author of this script the one missing something? I don't mind it either way, but I'd appreciate some clarification from knowledgeable people ;)<p>Thanks