This is interesting. Groq (chip co, not twitter’s ‘Grok’ LLM) has a similar silicon scale, I’m not sure about architecture, though. One very interesting thing about Groq that I failed to appreciate when they were originally raising is that the architecture is deterministic.<p>Why is determinism good for inference? If you are clever, you can run computations distributed without waiting for sync. I can’t tell from their marketing materials, but it’s also possible they went for the gold ring and built something latch-free on the silicon side.<p>Groq seems to have been able to use their architecture to deliver some insanely high token/s numbers; groqchat is by far the fastest inference API I’ve seen.<p>All this to say that I’m curious what a Dojo architecture designed around training could do. Presuming training was a key use case in the arch design. Knowing the long game thinking at Tesla, I imagine it was.
It needs 18,000 amps per wafer and puts off 15 KW of heat?<p>This feels a little like a “you were so preoccupied with whether or not you could” thing.<p>Wow.<p>I can’t imagine how much one of those wafers must cost. I’d love to know.<p>Hope it accomplishes what they want because they’ve certainly had to spend a fortune to get to this point.