This is what some of the new dedicated AI chips are designed to overcome. <a href="https://www.untether.ai/technology" rel="nofollow">https://www.untether.ai/technology</a> explicitly calls out the issue with Von Neumann architecture and has cells that combine compute and memory in one place. I'm pretty sure <a href="https://groq.com/" rel="nofollow">https://groq.com/</a> has a similar concept.<p>Some interesting stuff happens when you're memory bandwidth limited, particularly parallelism doesn't help, and it becomes faster, in an LLM, to use quantized 16-bit weights that get converted to float32 when used because the cpu can convert, multiply and add faster 16 bit values faster than memory can move 32 bit values to the cpu.