The issue I'm facing with this newer batch of larger models is trying to make longer contexts work. Is there a way to do so with sub-48GB GPUs without having to do CPU BLAS? If mistral-123B is already restricted to 60K context on a 24GB gpu (with zero layers being GPUfied and all other apps closed), and llama-405B being somewhere around 2-3x the KV cache size, even an A100 wouldn't be enough to fit 128K tokens of KV.<p>I thought before that, using koboldCPP, GPU VRAM shouldn't matter too much when just using it to accellerate prompt processing, but it's turning out to be a real problem, with no affordable card being even usable at all.<p>It's the difference between processing 50K tokens in 30 minutes vs. taking 24 hours or more to get a single response, from 'barely usable' to 'utterly unusable'.<p>CPU generation is fine, ~half a token per second is not great, but it's doable. Though I sometimes feel more and more like cutting off responses and finishing them myself if a good idea pops up in one.