I see that lots of folks are working on building LLMs that can handle more context without breaking the bank on GPU.<p>Is there a real practical reason for that right now or is it just something that everybody agrees is obvious without economic justification?
So they have had LLMs with small contexts like one or two words or a dozen letters for a long time, ever since like Laplace or Shannon or Markov. They were called Markov chains. No one really guessed this (although it was known to be theoretically possible in the sense of ai-completeness), but it turns out that longer ones turn out to even in practice unlock so many cognitive capabilities bordering on superhuman. If this is the main difference between the Markov chains that they have been using for autocomplete for decades versus the ones that will beat you at the GREs or the bar exams or every AP test, then it is natural they are curious what happens when they make the context even longer.
Longer context means more memory, effectively a longer history the LLM remembers. One issue i'm having is say functions works wonderfully, but context window is tight even with 16k tokens, with a bigger context, sky is the limit.