Groq's inference strategy appears to be "SRAM only." There is no external memory, like GGDR or HBM. Instead, large models are split between networked cards, and the inputs/outputs and pipelined.<p>This is a great idea... In theory. But it seems like the implementation (IMO) missed the mark.<p>They are using reticle size dies, running at high TDPs, at 1 die per card, with long wires running the interconnect.<p>A recent microsoft paper proposed a similar strategy, but with much more economical engineering. Instead, much smaller, cheaper SRAM heavy chips would be tiled across a motherboard, with no need for a power hungry long-range interconnect, no expensive dies on expensive PCIe cards. The interconnect is physically so much shorter and lower power by virtue of being <i>on</i> a motherboard.<p>In other words, I feel that Groq took an interesting inference strategy and ignored a big part of what makes it cool, packaging them like PCIe GPUs instead of tiled accelerators. Combined with the node disadvantage and compatibility disadvantage, I'm not sure how they can avoid falling into obscurity like Graphcore, which took a very similar SRAM heavy approach.