As the discussion of GPT-4 heats up, the absence of details on its technical implementation becomes only more glaring. As an engineer, I have not learned anything applicable I haven't known yesterday from the newest OpenAI publication!<p>I have been investigating issues of LLM training and inference for quite some time, and have developed a number of hypotheses about future SoTA models, which I believe very likely apply to GPT-4.
I'd like to know how it can support 32k when all the other models I've seen are 2-4k, does this mean it's got a bigger layer for attention or it's 4x billions of parameters Large?